Tuesday, April 13, 2010

Bloginar: Standardizing Bioinformatics with BioHDF (HDF5)

Yesterday we (The HDF Group and Geospiza) released the BioHDF prototype software.  To mark the occasion, and demonstrate some of BioHDF’s capabilities and advantages, I share the poster we presented at this year’s AGBT (Advances in Genome Biology and Technology) conference.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, specific aspects of the general Next Generation Sequencing (NGS) workflow, and HDF5’s advantages for working with large amounts of NGS data.
 
Section 1.  The first section introduces HDF5 (Hierarchical Data Format) as a software platform for working with scientific data.  The introduction begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multi-step processes that, 3) compare NGS data to multiple reference sequence databases, 4) the resulting datasets of alignments must be visualized in multiple ways, and 5) scientific knowledge is gained when many datasets are compared. 

Next, choices for managing NGS data are compared in a four category table.  These include text and binary formats. While text formats (delimited and XML) have been popular for bioinformatics, they do not scale well and binary formats are gaining in popularity. The current bioinformatics binary formats are listed (bottom left) along with a description of their limitations. 

The introduction closes with a description of HDF5 and its advantages for supporting NGS data management and analysis. Specifically, HDF5 is platform for managing scientific data. Such data are typically complex and consist of images, large multi-dimensional arrays, and meta data. HDF5 has been used for over 20 years in other data intensive fields; it is robust, portable, and tuned for high performance computing. Thus HDF5 is well suited for NGS. Indeed, groups from academic researchers to NGS instrument vendors, and software companies are recognizing the value of HDF5.
Section 2. This section illustrates how HDF5 facilitates primary data analysis. First we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). In many NGS assays, secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights. 

The remaining portion of section 2 shows how Illumina GA and SOLiD primary data (reads and quality values) can be stored in BioHDF and later reviewed using the BioHDF tools and scripts.  The resulting quality graphs are organized into three groups (left to right) to show base composition plots, quality value (QV) distribution graphs, and other summaries.

Base composition plots show the count of each base (or color) that occurs at a given position in the read. These plots are used to assess overall randomness of a library and observe systematic nucleotide incorporation errors or biases.

Quality value plots show the distribution of QVs at each base position within the ensemble of reads. As each NGS run produces many millions of reads, it is worthwhile summarizing QVs in multiple ways. The first plots, from the top, show the average QV per base with error bars indicating QVs that are within one standard deviation of the mean. Next, box and whisker plots show the overall quality distribution (median, lower and upper quartile, minimum and maximum values) at each position. These plots are followed by “error” plots which show the total count of QVs below certain thresholds (red, QV < 10; green QV < 20; blue, QV < 30). The final two sets of plots show the number of QVs at each position for all observed values and the number of bases having each quality value.

The final group of plots show overall dataset complexity, GC content (base space only), average QV/read, and %GC vs average QV (base space only).  Dataset complexity is computed by determining the number of times a given read exactly matches other reads in the dataset. In some experiments, too many identical reads indicates a problem like PCR bias. In other cases, like tag profiling, many identical reads are expected from highly expressed genes. Errors in the data can artificially increase complexity.
Section 3.  Primary data analysis gives us a picture of how well the samples were prepared or how well the instrument ran with some indication about sample quality. Secondary and tertiary analysis tell us about sample quality and more importantly, provides biological insights. The third section focuses on secondary and tertiary analysis and begins with a brief cartoon showing a high level data analysis workflow using BioHDF to store primary data, alignment results, and annotations. BioHDF tools are used to query these data and other software within GeneSifter is used to compare data between samples and display the data in interactive reports to examine the details from single or multiple samples.

The left side of this section illustrates what is possible with single samples. Beginning with a simple table that indicates how many reads align to each reference sequence, we can drill into multiple reports that provide increasing detail about the alignments. For example, the gene list report (second from top) uses gene model annotations to summarize the alignments for all genes identified in the dataset. Each gene is displayed as a thumbnail graphic that can be clicked to see greater detail, which is shown in the third plot. The Integrated Gene View not only shows the density of reads across the gene's genomic region, but also shows evidence of splice junctions, and identified single base differences (SNVs) and small insertions and deletions (indels). Navigation controls provide ways to zoom into and out of the current view of data, and move to new locations. Additionally, when possible, the read density plot is accompanied by an Entrez gene model and dbSNP data so that data can be observed in a context of known information. Tables that describe the observed variants follow. Clicking on a variant drills into the alignment viewer to show the reads encompassing the point of variation.

The right side illustrates multi-sample analysis in GeneSifter. In assays like RNA-Seq, alignment tables are converted to gene expression values that can be compared between samples. Volcano (top) and other plots are used visualize the differences between the datasets. Since each point in the volcano plot represents the difference in expression for a gene between two samples (or conditions), we can click on that point to view the expression details for that gene (middle) in the different samples. In the case of RNA-Seq, we can also obtain expression values for the individual exons with the gene, making it possible to observe differential exon levels in conjunction with overall gene expression levels (middle). Clicking the appropriate link in the exon expression bar graph, takes us to the alignment details for the samples being analyzed (bottom), in this example we have two cases and two control replicates. Like the single sample Integrated Gene Views, annotations are displayed with alignment data. When navigation buttons are clicked all of the displayed genes move together so that you can explore the gene's details and surrounding neighborhood for multiple samples in a comparative fashion.
Section 4.  The poster closes with details about BioHDF.  First, the data model is described. An advantage of the BioHDF model is that read data are organized non-redundantly. Other formats, like BAM, tend to store reads with alignments and if a read has multiple alignments in a genome, or is aligned to multiple reference sequences, it gets stored multiple times. This may seem trivial, but anything that can happen a million times, becomes noticeable. This fact is demonstrated in the in table listed in the second panel “High Performance Computing Advantages.”  Other HDF5 advantages are listed below the performance stats table.  Most notably is HDF5’s ability to easily support multiple indexing schemes like nested containment lists (NClists). NClists solve the problem of efficiently accessing reads from alignments that may be contained in other alignments, which I will save for a later post.

Finally, the poster is summarized with a number of take home points. These reiterate the fact that NGS is driving the need to use binary file formats to manage NGS and analysis results and that HDF5 provides an attractive solution because of its long history and development efforts that specifically target scientific programming requirements. In our hands, HDF5 has helped make GeneSifter a highly scalable and interactive web-application with less development effort than would have been needed to implement other technologies.  

If you are software developer and are interested in BioHDF please visit www.biohdf.org.  If you do not want to program and instead, want a way to easily analyze your NGS data to make new discoveries, please contact us

Labels: , , , , , , , ,

Sunday, March 14, 2010

Keeping Your DNA Sequencing, Genotyping, and Microarray Laboratory Competitive in a New Era of Genomics

ABRF 2010 is next week. The conference will be in sunny Sacramento CA. About 1000 technology geeks will convene to learn about the latest advances in DNA sequencing, genotyping, and proteomics instrumentation, lab protocols, and core lab services. We will be there with our booth and participate with LIMS and NGS data analysis presentations.

The first presentation, entitled "Keeping Your DNA Sequencing, Genotyping, and Microarray Laboratory Competitive in a New Era of Genomics," will be on Sunday Mar. 20 in the second concurrent workshop (w2) at 1:00 pm.

Abstract

Laboratory directors are facing enormous challenges with respect to keeping their laboratories competitive and retaining customers in the face of shrinking budgets and rapidly changing technology. A well-designed Laboratory Information Management System (LIMS) can help meet these challenges and manage costs as the scale and complexity of data collection and related services increase. LIMS can also offer competitive advantages through increased automation and improved customer experiences.

Implementing a LIMS strategy that will reduce data collection costs while improving competitiveness is a daunting proposition. LIMS are computerized data and information tracking systems that are highly variable with respect to their purpose, customization capabilities, and overall acquisition (initial purchase) and ownership (maintenance) costs. A simple LIMS can be built from a small number of spread sheets and track a few specific processes. Sophisticated LIMS rely on databases to manage multiple laboratory processes, capture and analyze different kinds of data, and provide decision support capabilities.

In this presentation, I will share 20 years of academic and industrial LIMS experiences and perspectives that have been informed through 100’s of interactions with core, research, and manufacturing laboratories engaged in DNA sequencing, genotyping, and microarrays. We’ll explore the issues that need to be addressed with respect to either building a LIMS, or acquiring a LIMS product. A new model that allows laboratories to offer competitive services, utilizing cost-effective laboratory automation strategies and new technologies like next generation sequencing, will be presented. We’ll also compare different IT infrastructures and discuss their advantages and how investments can be made to protect against unexpected costs as new instruments, like the HiSeq 2000(TM) or SOLiD 4 (TM), third generation sequencing, or other genetic analysis platforms are introduced.

Labels: , , ,

Wednesday, February 17, 2010

Standardizing the Next Generation of Bioinformatics Software Development With BioHDF (HDF5)

AGBT is next week, and well be there presenting a poster on our latest and greatest work with HDF5 and BioHDF tools. For those of you attending, check out the poster. For those unable to attend, check back later for the "Bloginar."

Abstract

Next Generation Sequencing technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. However, the lack of scalable data analysis capabilities limits their potential. Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and methods to reduce data storage. HDF5 is also more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications.

In the poster we will present:
  1. An overview of NGS data analysis and workflows
  2. A prototype data model for working with NGS data
  3. Practical examples of data analysis and viewing information using the underlying framework
  4. Performance benchmarks comparing HDF5 to other file formats 

Labels: , ,

Wednesday, February 3, 2010

Sneak Peak: Data Analysis Methods for Whole Transcriptome Sequencing Applications – Challenges and Solutions

RNA sequencing is one of the most popular Next Generation Sequencing (NGS) applications. Next Thursday, February 11 at 10:00 A.M. PDT (1:00 P.M. EDT), we kick off our 2010 webinar series with a presentation designed to help you understand whole transcriptome data analysis and what can be learned in these experiments. In addition, we will show off some of our latest tools and interfaces that can be used to discover new RNAs, new splice forms of transcripts, and alleles of expressed genes.

Summary

RNA sequencing applications such as Whole Transcriptome Analysis, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 500 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample (gene level expression summaries, exon usage, splice junction, single nucleotide variants, insertions and deletions), these applications are also ideal for the identification of novel RNAs as well as novel splicing events.

This presentation will provide an overview of Whole Transcriptome data analysis workflows with emphasis on calculating gene and exon level expression values as well as identifying splice junctions and variants from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and Lifetech’s SOLiD instruments.

Register Today!

Labels: , , , ,

Monday, January 25, 2010

Grant Opportunities for Next Generation DNA Sequencing

As we close the first month of 2010, it is time to get your pencils sharpened and submit proposals for new shared instruments.

The National Center for Research Resources (NCRR) announced that it has $43M to fund equipment purchases in 2011. With this money, NCRR expects to make approximately 125 new award for instruments that cost at least $100,000 but less than $600,000. NCRR proposals are due March 23, 2010.

In addition to NCRR, the National Science Foundation (NSF), through its Major Research Instrumentation (MRI) program, has $90M to make 150 awards of between $100,000 and $4M for shared instrumentation. MRI proposals are due April 21, 2010.

Remember, when preparing proposals a sound informatics plan will make your application stand out. Contact us if you’d like more information.

Labels: , ,

Monday, January 18, 2010

Systems Biology with HDF5

As many are aware, Geospiza and The HDF Group are collaborating to extend HDF (Hierarchical Data Format) technologies to support the data management needs of high performance computing applications in genomics. As we do this work, others are also adopting HDF5 as a data storage technology to work with different kinds of biological data.

The Association for Computing Machinery (ACM) recently published an article, "Unifying Biological Image Formats with HDF5," that argues for using HDF5 and HDF tools as a common framework for working with image files. This article is worth reading for several reasons.

First, it provides a nice introduction and background to HDF5, its origins, and movement towards becoming an ISO standard. HDF5's technical features are also included in this discussion.

Next, a brief history of the imaging community is covered to share how X-ray crystallographers, electron, and optical microscopists had all independently considered HDF5 as a framework for their next-generation image file formats. Through this discussion, the challenges that have been identified within the imaging community are listed.

Like genomics, the amounts of data being collected are ever increasing, current formats are inflexible and difficult to adapt to future modalities and dimensionality, and the nonarchival quality of data undermines long-term value. That is, current data typically lack sufficient metadata about their origins and experiments to be useful in the long-term.

The article goes on to make the point that current challenges with image data could be addressed if the community adopts an existing format that can support both generic and specialized data formats and meet a set of common requirements related to performance, interoperability, and archiving. Examples of how HDF5 meets these requirements are included. Briefly, HDF5's data caching can be used to overcome computation bottlenecks related to the fact that image sizes are exceeding RAM capacity. Interoperability issues can be addressed through HDF5's ability to store multiple metadata schemas in flexible ways. And, because HDF5 is self describing, data stored in HDF5 can be better preserved.

Finally, a barrier to moving to a new technology is supporting legacy applications that may be costly to replace. Thus, the article closes with a creative proposal for supporting legacy software applications and recommendations for future development. HDF5 files could support legacy software applications if they were able to present the data, stored within the HDF5 file, as the collection of directories and files required by the legacy application. This could be accomplished by developing an abstraction layer that could interact with FUSE (Filesystem in User Space) and essentially mount the HDF5 file as a virtual file system. Such a scenario is only possible because data are stored in HDF5 in a general way that can be further abstracted and presented in multiple specific ways.

While this article focused on issues related to image formats, there are many parallels that the genomics and Next Generation Sequencing communities should pay attention to, and if you are a bioinformatics software developer or running bioinformatics projects, you should put this paper on your must read list.

Labels: , , ,

Wednesday, January 13, 2010

2010 sequencing starts in style

Next Generation Sequencing (NGS) is a hot topic. As we kick off 2010, many themes continue. Data throughput is increasing, sequencing costs are decreasing, and NGS still requires extensive informatics support.

Throughput up, costs down

As sequencing throughput increases, the costs for collecting sequencing data decrease. Illumina is setting the pace for 2010 by announcing its latest sequencing instrument, the HiSeq2000. Illumina’s press release, news reports, and the blogosphere enthusiastically report on the instrument’s five fold increase in data throughput and ability to sequence an entire human genome in about one week for about $10,000.

What about the informatics?

This month’s reviews and editorials in Nature Reviews Genetics (NRG) and Nature Biotechnology (NBT), respectively, claim that the most significant NGS challenge continues to be dealing with the data. As pointed out in the NRG editorial, it is quite possible that the community will produce more sequence data this year than has been cumulatively produced in the past 10 years. The HiSeq, developments that will be announced by Applied Biosystems in February, and the coming single molecule sequencers support this. The editorial further makes the point that genome centers have the computing infrastructure to deal with the data, but the larger community of researchers, who could benefit from these technologies, do not. A similar observation was made at the end of the NBT review which pointed out that costs associated with downstream handling and processing of the data will possibly equal or exceed data collection costs.

The significance of the informatics challenge is that wide adoption of NGS technologies assumes that we have usable solutions for working with the data. These solutions go beyond simply getting a computer cluster with a sequencing instrument. To be useful, that cluster needs to reside in an adequately air conditioned room, be operated by people who know how to work with cluster hardware and software and can also optimize networks to manage the flow of data. Other individuals are needed who can write programs and scripts to process the data, work with multiple database technologies, and develop scalable user interfaces to visualize and navigate through the results and compare information between multiple samples and experiments.

The conversation about the informatics problem began with the introduction of NGS technologies. In 2008, Nature Methods (July) and NBT (October) published editorials speaking to the coming challenges. Later in 2009, Science published a new article about data intensive science. Previous FinchTalks have discussed the articles and their significance and the theme has remained the same; both the access to computing technologies and the skills needed to use the data are unavailable to the large numbers of researchers who need to use these technologies to remain competitive.

There is a solution

One solution to the informatics challenge created by NGS, and other data intensive technologies, is to make use of the immense Internet-based computing infrastructure that has been created by companies like Amazon, Google, Yahoo, and others. Also called Cloud Computing, Internet-based services remove many of the hardware and infrastructure barriers for utilizing high performance computing and storage technology. This message was delivered by the 2010 NBT kick off editorial and accompanying news feature, along with the next important message that software solutions also need to be adapted to cloud environments. Here the editorial, like many other descriptions of NGS informatics needs, falls short in that they only focus on alignment programs. Simply adapting alignment algorithms using technologies like Hadoop to employ Cloud-based high performance computing clusters is not a sufficient solution.

Aligning billions of reads to reference data quickly and accurately is clearly important. However it is just the first step of a complex analysis process. The subsequent steps of analyzing the billions of alignments to filter artifacts, identify true and new variation between sequences, discover alternative splice forms in transcripts, and compare data between samples are even more challenging.

Fortunately Geospiza understands the problem well. As our tag line, From Samples to ResultsTM, suggests, our lab and analysis systems focus on solving a complete set of problems that need to be addressed in order to do good science with NGS and other genetic analysis technologies.

Perhaps this is way we were the only software provider discussed in the NBT news feature, “Up in a cloud.”

Labels: , , ,

Sunday, December 6, 2009

Expeditiously Exponential: Genome Standards in a New Era

One of the hot topics of 2009 has been the exponential growth in genomics and other data and how this growth will impact data use and sharing. The journal Science explored these issues in its policy forum in Oct. In early November, I discussed the first article, which was devoted to sharing data and data standards. The second article, listed under the category “Genomics,” focuses on how genomic standards need to evolve with new sequencing technologies.

Drafting By

The premise of the article “Genome Project Standards in a New Era of Sequencing” was to begin a conversation about how to define standards for sequence data quality in this new era of ultra-high throughput DNA sequencing. One of the “easy” things to do with Next Generation Sequencing (NGS) technologies is create draft genome sequences. A draft genomic sequence is defined as a collection of contig sequences that result from one, or a few, assemblies of large numbers of smaller DNA sequences called reads. In traditional Sanger sequencing a read was between 400 and 800 bases in length and came from a single clone, or sub-clone of a large DNA fragment. NGS reads, come from individual molecules in a DNA library and vary between 36 and 800 bases in length depending on the sequencing platform being used (454, Illumina, SOLiD, or Helicos).

A single NGS run can now produce enough data to create a draft assembly for many kinds of organisms with smaller genomes such as viruses, bacteria, and fungi. This makes it possible to create many draft genomes quickly and inexpensively. Indeed the article was accompanied by a figure showing that the current growth of draft sequences exceeds the growth of finished sequences by a significant amount. If this trend continues, the ratio of draft to finished sequences will grow exponentially into the foreseeable future.

Drafty Standards

The primary purpose for a complete genome sequence is to serve as a reference for other kinds of experiments. A well annotated reference is accompanied by a catalog of genes and their functions, as well as an ordering of the genes, regulatory regions, and the sequences needed for evolutionary comparisons that further elucidate genomic structure and function. A problem with draft sequences is that they can contain a large numbers of errors that range from incorrect base calls to more problematic mis-assemblies that place bases or groups of bases in the wrong order. Because, these holes leave some sequences are more drafty than others, they are less useful in fulfilling their purpose as reference data.

If we can describe the draftiness of a genome sequence we may be able to weight its fitness for a specific purpose. The article went on to tackle this problem by recommending a series of qualitative descriptions that describe levels of draft sequences. Beginning with the Standard Draft, or an assembly of contigs of unfiltered data from one or more sequencing platforms, the terms move through High-Quality Draft, to Improved High-Quality Draft, to Annotation-Directed Improvement, to Noncontiguous Finished, to Finished. Finished sequence is defined as less than 1 error per 100,000 bases and each genomic unit (chromosomes or plasmids that are capable of replication) is assembled into a single contig with a minimal number of exceptions. The individuals proposing these standards are a well respected group in the genome community and are working with the database groups and sequence ontology groups to incorporate these new descriptions into data submissions and annotations for data that may be used by others.

Given the high cost and lengthy time required to finish genomic sequences, finishing every genome to a high standard is impractical. If we are going to work with genomes that are finished to varying degrees, systematic ways to describe the quality of the data are needed . This policy recommendations are a good start, but more needs to be done to make the proposed standards useful.

First, standards need to be quantitative. Qualitative descriptions are less useful because they create downstream challenges when reference data are used in automated data processing and interpretation pipelines. As the numbers of available genomes grow into the thousands and tens of thousands, subjective standards make the data more and more cumbersome and difficult to review. Moreover without quantitative assessment, how will one know when they have an average error rate of 1 in 100,000 bases? The authors intentionally avoided recommending numeric thresholds in the proposed standards because the instrumentation and sequencing methodologies are changing rapidly. This may be true, but future discussions nevertheless should focus on quantitative descriptions for that very reason. It is because data collection methods and instrumentation are changing rapidly that we need measures we can compare. This is the new world.

Second, the article fails to address how the different standards might be applied in a practical sense. For example, what can I expect to do with a finished genome that I cannot do with a nearly finished genome? What is a standard draft useful for? How should I trust my results and what might I expect to do to verify a finding? While the article does a good job describing the quality attributes of the data that genome centers might produce, the proposed standards would have broader impact if they could more specifically set expectations of what could be done with data.

Without this understanding, we still won't know when when our data are good enough.

Labels: , , ,

Sunday, November 22, 2009

Supercomputing 09

Teraflops, exaflops, exabytes, exascale, extreme, high dimensionality, 3D Internet, sustainability, green power, high performance computation, 400 Gbps networks, and massive storage were just some of the many buzz words seen and heard last week at the 21st annual supercomputing conference in Portland, Oregon.

Supercomputing technologies and applications are important to Geospiza. As biology becomes more data intensive, Geospiza follows the latest science and technology developments by attending conferences like supercomputing. This year, we participated in the conference through a "birds of a feather" session focused on sharing recent progress in the BioHDF project.

Each year the Supercomputing (SC) conference has focus areas called "thrusts." This year the thrusts were 3D Internet, Biocomputing, and Sustainability. Each day of the technical session started with a keynote presentation that focused on one of the thrusts. Highlights from the keynotes are discussed below.

First thrust: the 3D Internet

The technical program kicked off with a keynote from Justin Rattner, VP and CTO at Intel. In his address, Rattner discussed the business reality that high performance computing (HPC) is an $8 billion business with little annual growth (3% AGR). The primary sources for HPC funding are government and trickle up technology from PC sales. To break the dependence on government funding, Rattner suggested that HPC needs a "killer app" and suggested that the 3D Internet might just be that app. He went on to elaborate on the kinds of problems, such as continuously simulating environments, realistic animation, dynamic modeling and continuous perspectives, that are solved with HPC. Also, because immersive and collaborative virtual environments can be created, the 3D Internet provides a platform for enabling many kinds of novel work.

To illustrate, Rattner was joined by Aaron Duffy, a researcher at Utah State. Rather, Duffy’s avatar joined us as his presentation was in the Science SIM environment. Science SIM is a virtual reality system that is used to model environments and how they respond to change. For example, Utah State is studying how ferns respond to and propagate genetic changes in FernLand. Another example included how 3D modeling can save time and materials in fashion design.

Next, Rattner introduced how the current 3D Internet resembles the early days of the Internet when people struggled with the isolated networks of AOL, Prodigy and Compuserve. It wasn't until Tim Berners-Lee, and Marc Andreessen introduced the World Wide Web http protocol and Mosiac web browser, that the Internet had a platform on which to standardize. Similarly, the 3D Internet needs such a platform. Rattner introduced OpenSim as a possibility. In the OpenSim platform, extensible modules can be used to create different worlds. Because these worlds are built with a common infrastructure, users could have an avatar that could move between worlds, rather than have a new avatar for each world as they do today.

Second thrust: biocomputing

Leroy Hood kicked off the second day with a keynote on how supercomputing can be applied to systems biology and personalized medicine. Hood expects that within 10 years diagnostic assays will be characterized by billions of measurements. We will have two primary kinds of information feeding these assays: the digital data of the organism and data from the environment. The challenge is measuring how the environment affects the organism. To make this work we need to integrate biology, technology, and computers in better ways then we do today.

In terms of personalized medicine, Hood described different kinds of analyses and their purpose. For example, global analysis - such as sequencing a genome, measuring gene expression, or comprehensive protein analysis - creates catalogs. These catalogs then form the foundation for future data collection and analysis. The goal of such analysis is to create predictive actionable models. Biological data however, are noisy, and meaningful signals can be difficult to detect, so improving the signal to noise ratio requires the ability to integrate large volumes of multi-scalar data with diverse data types including biological knowledge. As the goal is to develop predictive actionable models we need supercomputers capable of dynamically quantifying information.

As an example, Hood presented work showing how disease states result in perturbations in regulated networks. In prion disease, the expression of many genes change over time as non-disease states move toward disease states. Interestingly, as disease progression is followed in mouse models, one can see expression levels change in genes that were not thought to be involved in prion disease. More importantly, these genes show expression changes before the physiological effects are observed. In other words, by observing gene expression patterns, one can detect a disease much earlier than they would by observing symptoms. Because diseases detected early are easier to treat, early detection can have beneficial consequences for reducing health care costs. However, measuring gene expression changes by observing changes in RNA levels is currently impractical. The logical next step is to see if gene expression can be measured by detecting changes in the levels of blood proteins. Of course, Hood and team are doing that too, and he showed data, from the prion model, that this is a feasible approach.

Using the above example, and others from whole genome sequencing, Hood painted a picture of future diagnostics where we will have our genomes sequenced at birth and each of us will have a profile created of organ specific proteins. In Hood's world, this would require 50 measurements from 50 organs. Blood protein profiles will be used as controls in regular diagnostic assays. In other assays, like cancer diagnostics, 1000’s of individual transcriptomes will be measured simultaneously in single assays. Similarly, 10,000 B-cells or T-cells could be sequenced to asssess immune states and diagnose autoimmune disorders. In the not too distant future, it will be possible to interrogate databases containing billions of data points from 100's of millions of patients.

With these possibilities on the horizon, there are a number of challenges that must be overcome. Data space is infinite, so queries must be constructed carefully. The data that need to be analyzed have high dimensionality, so we need new ways to work with these data. Finally multi-scale datasets must be integrated together and data analysis systems must be interoperable. Meeting these final challenges requires that standards for working with data be developed and adopted. Finally, Hood made the point that groups like his can solve some of the scientific issues related to computation, but not the infrastructure issues that must also be solved to make the vision a reality.

Fortunately, Geospiza is investigating technologies to meet current and future biocomputing challenges through the company’s product development and standards initiatives like the BioHDF project.

Third thrust: sustainability

Al Gore gave the third day’s keynote address and much of his talk addressed climate change. Gore reminded us that 400 years ago, Galileo collected the data that supported Copernicus’ theory that the earth’s rotation creates the illusion of the sun moving across the sky. He went on to explain how Copernicus reasoned that the illusion is created because the sun is so far away. Gore also explained how difficult it was for people of Copernicus', or Galileo’s, time to accept that the universe does not rotate around the earth.

Similarly, when we look into the sky we see an expansive atmosphere that seems to go on for ever. Pictures from space however tell a different story. Those pictures show us that our atmosphere is a thin band, only 1/1000 the size of the earth’s volume. The finite volume our atmosphere explains how we can change our climate when we pump billions of tons of CO[2] into the atmosphere as we are doing now. It is also hard for many conceptualize that the CO[2] is affecting the climate when they do not see or feel direct or immediate effects. Gore added the interesting connections that the first oil well, drilled by “Colonel” Edwin Drake in Pennsylvania, and discovery, by John Tyndall, that CO[2] absorbs infrared radiation both occurred in 1859. 150 years ago we not only had the means to create climate change, but understood how it would work.

Gore outlined a number of ways in which supercomputing and the supercomputing community can help with global warming. Climate modeling and climate prediction are two important areas where supercomputers are used. Conference presentations and and demonstrations on the exhibit floor made this clear. Less obvious applications involve modeling new electrical grids and more efficient modes of transportation. Many of the things we rely on daily are based on infrastructures that are close 100 years old. From our internal combustion engines to our centralized electrical systems, inefficiency can be measured in billions of dollars that are lost annually to system failures or energy consumption that is not effective.

Gore went on to remind us that Moore’s law is a law of self-fulfilling expectations. When first proposed, it was a recognition of design and implementation capabilities with an eye to the future. Moore’s law worked because R&D funding was established to stay on track. We now have an estimated one billion transistors for every person on the planet. If we commit similar efforts to improving energy efficiency in ways analogous to Moore’s law, we can create a new self fulfilling paradigm. The benefits of such a commitment would be significant. As Gore pointed out, our energy, climate, and economic crises are intertwined. Much of our national policy is in reaction to oil production disruption or the threat of disruption, and the costs of our policies are significant.

In closing, Gore stated that supercomputing is the most powerful technology we have today and represents the third form of knowledge creation. The first two being inductive and deductive reasoning. With supercomputers we can collect massive amounts of data, develop models and use simulation to develop predictive and testable hypotheses. Gore noted that humans have a low bit rate, but high resolution. This means that while our ability to absorb data is slow, we are very good at recognizing patterns. Thus computers, with their ability to store and organize data, can be programmed to convert data into information and display information in new ways to give us new insights for solutions to the most vexing problems.

This last point resonated through all three keynotes. Computers are giving us new ways to work with data and understand problems; they are also providing new ways to share information and communicate with each other.

Geospiza is keenly aware of this potential and a significant focus of our research and development is directed toward solving data analysis, visualization, and data sharing problems in genomics and genetic analysis. In the area of Next Generation Sequencing (NGS), we have been developing new ways to organize and visualize the information contained in NGS datasets to easily spot patterns amidst the noise.

Labels: ,

Sunday, November 8, 2009

Expeditiously Exponential: Data Sharing and Standardization

We can all agree that our ability to produce genomics and other kinds of data is increasing at exponential rates. Less clear, is understanding the consequences for how these data will be shared and ultimately used. These topics were explored in last month's (Oct. 9, 2009) policy forum feature in the journal Science.

The first article, listed under the category "megascience," dealt with issues about sharing 'omics data. The challenge being that systems biology research demands that data from many kinds of instrument platforms (DNA sequencing, mass spectrometry, flow cytometry, microscopy, and others) be combined in different ways to produce a complete picture of a biological system. Today, each platform generates its own kind of "big" data that, to be useful, must be computationally processed and transformed into standard outputs. Moreover, the data are often collected by different research groups focused on particular aspects of a common problem. Hence, the full utility of the data being produced can only be realized when the data are made open and shared throughout the scientific community. The article listed past efforts in developing sharing policies and the central table included 12 data sharing policies that are already in effect.

Sharing data solves half of the problem, the other aspect is being able to use the data once shared. This requires that data be structured and annotated in ways that make it understandable by a wide range of research groups. Such standards typically include minimum information check lists that define specific annotations, and which data should be kept from different platforms. The data and metadata are stored in structured documents that reflect a community's view about what is important to know with respect to how data were collected and the samples the data were collected from. The problem is that annotation standards are developed by diverse groups and, like the data, are expanding. This expansion creates new challenges with making data interoperable; the very problem standards try to address.

The article closed with high-level recommendations for enforcing policy through funding and publication requirements and acknowledged that full compliance requires that general concerns with pre-publication data use and patient information be addressed. More importantly, the article acknowledged that meeting data sharing and formatting standards has economic implications. That is, researches need time-efficient data management systems, the right kinds of tools and informatics expertise to meet standards. We also need to develop the right kind of global infrastructure to support data sharing.

Fortunately complying with data standards is an area where Geospiza can help. First, our software systems rely on open, scientifically valid tools and technologies. In DNA sequencing we support community developed alignment algorithms. The statistical analysis tools in GeneSifter Analysis Edition utilize R and BioConductor to compare gene expression data from both microarrays and DNA sequencing. Further, we participate in the community by contributing additional open-source tools and standards through efforts like the BioHDF project. Second, the GeneSifter Analysis and Laboratory platforms provide the time-effiecient data management solutions needed to move data through its complete life cycle from collection, to intermediate analysis, to publishing files in standard formats.

GeneSifter lowers researcher's economic barriers of meeting data sharing and annotation standards keep the focus on doing good science with the data.

Labels: , , , , , , ,

Friday, October 23, 2009

Yardsticks and Sequencers

A recent question to the ABRF discussion forum, about quality values and Helicos data, led to an interesting conversation about having yardsticks to compare between Next Generation Sequencing (NGS) platforms and the common assays that are run on those platforms.

It also got me thinking, just how well can you measure things with those free wooden yardsticks you get at hardware stores and home shows?

Background

The conversation started with a question asking about what kind of quality scoring system could be applied to Helicos data. Could something similar to Phred and AB files be used?

A couple of answers were provided. One referred to the recent Helicos article in Nature Biotechnology and pointed out that Helicos has such a method. This answer also addressed the issue that quality values (QVs) need to be tuned for each kind of instrument.

Another answer, from a core lab director with a Helcos instrument, pointed out many more challenges that exist with comparing data from different applications and how software in this area is lacking. He used the metaphor of the yardstick to make the point that researchers need systematic tools and methods to compare data and platforms.

What's in a Yardstick?

I replied to the thread noting that we've been working with data from 454, Illumina GA, SOLiD and Helicos and there are multiple issues that need to be addressed in developing yardsticks to compare data from different instruments for different experiments (or applications).

At one level, there is the instrument and the data that are produced and the question is can have a standard quality measure? In Phred, we need to recall that each instrument needed to be calibrated so that quality values would be useful and equivalent across chemistries and platforms (primers, terminators, bigdye, gel, cap, AB models, MegaBACE ...). Remember phredpar.dat? Because the data were of a common type - an electropherogram - we could more or less use a single tool and define a standard. Even then, other tools (LifeTrace, KB basecaller, and LongTrace) emerged and computed standardized quality values differently. So, I would argue that we think we have a measure, but it is not the standard we think it is.

By analogy, each NGS instrument uses a very different method to generate sequences, so each platform will have a unique error profile. The good news is that quality values, as transformed error probabilities, make it possible to compare output from different instruments in terms of confidence. The bad news is that if you do not know how the error probability is computed, or you do not have enough data (control, test) to calibrate the system, error probabilities are not useful. Add to that, the fact that the platforms are undergoing rapid change as they improve chemistry, change hardware and software to increase throughput and accuracy. So, for the time being we might have yardsticks, but they have variable lengths.

The next levels deal with experiments. As noted ChiP-Seq, RNA-Seq, Me-Seq, Re-Seq, and your favorite-Seq all measure different things and we are just learning about how errors and other artifacts interfere with how well the data produced actually measure what the experiment intended to measure. Experiment level methods need to be developed so that ChiP-Seq from one platform can be compared to ChiP-Seq from another platform and so on. However, the situation is not dire because in the end, DNA sequences are the final output and for many purposes the data produced are much better now then they have been in the past. As we push sensitivity, the issues already discussed become very relevant.

As a last point, the goal many researchers will have is to layer data from on experiment on another experiment, correlate ChIP-Seq with RNA-Seq for example and to do that you not only need to have quality measures for data, sample, experiment, you also need ways to integrate all of this experimental information with already published data. There is a significant software challenge ahead and, as pointed out, cobbling solutions together is not a long term feasible answer. The datasets are getting to big and complex and at the same time the archives are busting with data generated by others.

So what does this have to do with yardsticks?

Back to yardsticks. Those cheap wooden yardstick expand and contract with temperature and humidity, so at different times a yardstick's measurements will change. This change is the uncertainty of the measurement (see additional reading below), which defines the precision of our measuring device. If I want a quick estimate of how tall my dog stands, I would happily use the wooden yardstick. However, if I want to measure something to within a 32nd of an inch or millimeter, I would use a different tool. The same rules apply to DNA sequencing, for many purposes the reads are good enough and data redundancy overcomes errors, but as we push sensitivity and want to measure changes in fewer molecules, discussions about how to compute QVs and annotate data, so that we know which measuring device was used, become very important.

Finally, I often see in the literature, company brochures, and hear in conversation that refer to QVs as Phred scores. Remember: Only Phred makes Phred QVs - everything else is Phred-like, but only if it is a -10log(P) transformation of an error probability.

Additional Reading:

Color Space, Flow Space, Sequence Space, or Outer Space: Part I. Uncertainty in DNA Sequencing

Color Space, Flow Space, Sequence Space or Outer Space: Part II, Uncertainty in Next Gen Data

Labels: , ,

Tuesday, October 13, 2009

Super Computing 09 and BioHDF

Next month, Nov 16-20, we will be in Portland for Super Computing 09 - SC09. Join us at a Birds of a Feather (BoF) session to learn about developing bioinformatics applications with BioHDF. The session will be Wed. Nov 18 at 12:15 pm in room D139-140.

Developing Bioinformatics Applications with BioHDF

In this session we will present how HDF5 can be used to work with large volumes of DNA sequence data. We will cover the current state of bioinformatics tools that utilize HDF5 and proposed extensions to the HDF5 library to create BioHDF. The session will include a discussion of requirements that are being be considered to develop a data models for working with DNA sequence alignments to measure variation within sets of DNA sequences.

HDF5 is an open-source technology suite for managing diverse, complex, high-volume data in heterogeneous computing and storage environments. The BioHDF project is investigating the use of HDF5 for working with very large scientific datasets. HDF5 provides a hierarchical data model, binary file format, and collection of APIs supporting data access. BioHDF will extend HDF5 to support DNA sequencing requirements.

Initial prototyping of BioHDF has demonstrated clear benefits. Data can be compressed and indexed in BioHDF to reduce storage needs and enable very rapid (typically, few millisecond) random access into these sequence and alignment datasets, essentially independent of the overall HDF5 file size. Additional prototyping activities we have identified key architectural elements and tools that will form BioHDF.

The BoF session will include a presentation of the current state of BioHDF and proposed implementations to encourage discussion of future directions.

Labels: , , ,

Tuesday, October 6, 2009

From Blue Gene to Blue Genome? Big Blue Jumps In with DNA Transistors

Today, IBM announced that are getting into the DNA sequencing business and race for the $1,000 dollar genome by winning a research grant to explore new sequencing technology based on nanopore devices they call DNA transistors.

IBM news travels fast. Genome Web and The Daily Scan covered the high points and Genetic Future presented a skeptical analysis of the news. You can read the original news at the IBM site, and enjoy a couple of videos.

A NY Times article a listed a couple of facts that I thought were interesting: First, IBM becomes the 17th company to pursue the next next-gen (or third-generation) technology. Second, according to George Church, in the past five years the cost of collecting DNA sequence data has decreased by 10 fold annually and is expected to continue decreasing at a similar pace for the next few years.

But what does this all mean?

It is clear from this and other news that DNA sequencing is fast becoming a standard way to study genomes, gene expression, and measure genetic variation. It is also clear the while the cost of DNA sequencing is decreasing at a fast rate, the amount of data being produced is increasing at a similarly fast rate.

While some of the articles above discussed the technical hurdles nanopore sequencing must overcome, none discussed the real challenges researchers face today with using the data. The fact is, for most groups, the current next-gen sequencers are under utilized because the volumes of data combined with the complexity of data analysis has created a significant bioinformatics bottleneck.

Fortunately, Geospiza is clearing data analysis barriers by delivering access to systems that provide standard ways of working with the data and visualizing results. For many NGS applications, groups can upload their data to our servers, align reads to reference data sources, and compare the resulting output across multiple samples in efficient and cost effective processes.

And, because we are meeting the data analysis challenges for all of the current NGS platforms, we'll be ready for whatever comes next.

Labels: , ,

Wednesday, September 23, 2009

GeneSifter in Current Protocols

This month we are pleased to report Geospiza's publication of the first standard protocols for analyzing Next Generation Sequencing (NGS) data. The pulication, appearing in the September issue of Current Protocols, addresses how to analyze data from both microarray, and NGS experiments. The abstract and links to the paper and our press release are provided below.

Abstract

Transcription profiling with microarrays has become a standard procedure for comparing the levels of gene expression between pairs of samples, or multiple samples following different experimental treatments. New technologies, collectively known as next-generation DNA sequencing methods, are also starting to be used for transcriptome analysis. These technologies, with their low background, large capacity for data collection, and dynamic range, provide a powerful and complementary tool to the assays that formerly relied on microarrays. In this chapter, we describe two protocols for working with microarray data from pairs of samples and samples treated with multiple conditions, and discuss alternative protocols for carrying out similar analyses with next-generation DNA sequencing data from two different instrument platforms (Illumina GA and Applied Biosystems SOLiD).

In the chapter we cover the following protocols:
  • Basic Protocol 1: Comparing Gene Expression from Paired Sample Data Obtained from Microarray Experiments
  • Alternate Protocol 1: Compare Gene Expression from Paired Samples Obtained from Transcriptome Profiling Assays by Next-Generation DNA Sequencing
  • Basic Protocol 2: Comparing Gene Expression from Microarray Experiments with Multiple Conditions
  • Alternate Protocol 2: Compare Gene Expression from Next-Generation DNA Sequencing Data Obtained from Multiple Conditions

Links

To view the abstract, contents, figures, and literature cited online visit: Curr. Protoc. Bioinform. 27:7.14.1-7.14.34

To view the press release visit: Geospiza Team Publishes First Standard Protocol for Next Gen Data Analysis

Labels: , , , , , , , ,

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Labels: , , , , , , ,

Sunday, September 6, 2009

Open or Closed

A key aspect of Geospiza’s software development and design strategy is to incorporate open scientific technologies into the GeneSifter products to deliver user friendly access to best-of-breed tools used to manage and analyze genetic data from DNA sequencing, microarray, and other experiments.

Open scientific technologies include open-source and published academic algorithms, programs, databases, and core infrastructure software such as operating systems, web servers, and other components needed to build modern systems for data management. Unlike approaches that rely on proprietary software, Geospiza’s adoption of open platforms and participation in the open-source community benefits our customers in numerous ways.

Geospiza’s Open Source History

When Geospiza began in 1997, the company started building software systems to support DNA sequencing technologies and applications. Our first products focused on web-enabled data management for DNA sequencing-based genomics applications. Foundational infrastructure, such as the web-server, and application layer incorporated Apache and Perl. We were also leaders, in that our first systems operated on Linux, an open-source UNIX-based operating system. In those early days, however, we used proprietary databases such as Solid and Oracle because the open-source alternatives Postgres and MySQL were still lacking features needed to support robust data processing environments. As these products matured, we extended our application support to include Postgres to deliver cost-effective solutions for our customers. By adopting such open platforms we were able to deliver robust, high performing systems, rapidly at a reasonable cost.

In addition to using open-source technology as the foundation of our infrastructure, we also worked with open tools to deliver our scientific applications. Our first product, the Finch Blast-Server, utilized the public domain BLAST from NCBI. Where possible, we sought to include well-adopted tools for other applications such as base calling and sequence assembly and repeat masking, for which the source code was made available. We favored these kinds tools over developing our own proprietary tools, because it was clear that technologies emerging from communities like the genome centers would advance much quicker and be better tuned to the problems people were trying to address. Further, these tools, because of their wide adoption within their community and publication, received higher levels of scrutiny and validation than their proprietary counterparts.

Times Change

In the early days, many of the genome center tools were licensed by universities. As the bioinformatics field matured, open-source models for delivering bioinformatics software have become more popular. Led by NCBI and pioneered by organizations like TIGR (now JCVI) and the Sanger institute, the majority of useful bioinformatics programs are now being delivered open-source either under GPL, BSD like, or Perl Artistic style licenses (www.opensource.org). The authors of these programs have benefited from wider adoption of their programs and continued support from funding agencies like NIH. In some cases other groups are extending best-of-breed technologies into new applications.

A significant benefit of the increasing use of open-source licensing is that a large number of analytical tools are readily available for many kinds of applications. Today we have robust statistical platforms like R and BioConductor and several algorithms for aligning Next Gen Sequencing (NGS) data. Because these platforms and tools are truly open-source, bioinformatics groups can easily access these technologies to understand how they work and compare other approaches to their own. This creates a competitive environment for bioinformatics tool providers that drives improvements in algorithm performance and accuracy and the research community benefits greatly.

Design Choices

Early on, Geospiza recognized value incorporating tools from the academic research community into our user friendly software systems. Such tools were being developed in close collaboration with the data production centers that were trying to solve scientific problems associated with DNA sequence assembly and analysis. Companies developing proprietary tools designed to compete with these efforts were at a disadvantage, because they did not have real time access to conversations between biologists, lab specialists, and mathematicians needed to quickly develop the deep experience of working with biologically complex data. This disadvantage continues today. Further, the closed nature of proprietary software limits the ability to publish work and have critical peer review of the code needed to ensure scientific validation.

Our work could proceed more quickly because we did not have to invest in solving the research problems associated with developing algorithms. Moreover, we did not have to invest in proving the scientific credibility of an algorithm. Instead we could cite published references and keep our focus on solving problems associated delivering the user interfaces needed to work with the data. Our customers benefited by gaining easy access to best-of-breed tools and having the knowledge that they had a community to draw on to understand their scientific basis.

Geospiza continues its practice of adopting open best-of-breed technologies. Our NGS systems utilize multiple tools such as MAQ, BWA, Bowtie, MapReads and others. GeneSifter Analysis Edition utilizes routines from the R and BioConductor package to perform statistical computations to compare datasets from microarray and NGS experiments. In addition, we are addressing issues related to high performance computing through our collaboration with the HDF Group and the BioHDF project. In this case we are not only adopting open-source technology, but also working with leaders in the field to make open-source contributions of our own.

When you use Geospiza’s GeneSifter products you can be assured that you are using the same tools as the leaders in our fields to receive the benefits of reducing data analysis costs combined with the advantages of community support through forums and peer reviewed literature.

Labels: , , ,

Sunday, August 16, 2009

BioHDF on the Web

During the past spring and early part of summer, we presented our initial work using HDF5 technology to make next generation DNA sequencing data management scalable. The presentations are posted on web, along with other points of interest that are listed below.

Presentations:
Presentations by Mark Welsh, and myself can be found at SciVee.
Mark presents our poster at ISMB , and I present our work at the “Sequencing, Finishing and Analysis in the Future Meeting,” in Santa Fe.

We also presented at this and last year’s BOSC meetings that were held at ISMB. The abstracts and slides can be found at:

What others are thinking:
Real time commentary on the 2009 BOSC presentation can be found at friendfeed and another post. The Fisheye Perspective considers how HDF5 fits with semantic web tools.

HDF in Bioinformatics:
Check out Fast5 for using HDF5 to store sequences and base probablities.

BioHDF in the News:
Genome Web and Bioinform articles on HDF5 or referencing HDF5 include:

FinchTalk:
Links to FinchTalks about BioHDF from 2008 t0 present include:
Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. HDF Software :
You can learn more about HDF5 and get the software and tools at:

Labels: , , ,

Sunday, July 12, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part V: Why HDF5?

Through the course of this BioHDF bloginar series, we have demonstrated how the HDF5 (Hierarchical Data Format) platform can successfully meet current and future data management challenges posed by Next Generation Sequencing (NGS) technologies. We now close the series by discussing the reasons why we chose HDF5.

For previous posts, see:

  1. The introduction
  2. Project background
  3. Challenges of working with NGS data
  4. HDF5 benefits for working with NGS data

Why HDF5?

As previously discussed, HDF technology is designed for working with large amounts of complex data that naturally organize into multidimensional arrays. These data are composed of discrete numeric values, strings of characters, images, documents, and other kinds of data that must be compared in different ways to extract scientific information and meaning. Software applications that work with such data must meet a variety of organization, computation, and performance requirements to support the communities of researchers where they are used.

When software developers build applications for the scientific community, they must decide between creating new file formats and software tools for working with the data, or adapting existing solutions that already meet a general set of requirements. The advantage of developing software that's specific to an application domain is that highly optimized systems can be created. However, this advantage can disappear when significant amounts of development time are needed to deal with the "low-level" functions of structuring files, indexing data, tracking bits and bytes, making the system portable across different computer architectures, and creating a basic set of tools to work with the data. Moreover, such a system would be unique, with only a small set of users and developers able to understand and share knowledge concerning its use.

The alternative to building a highly optimized domain-specific application system is to find and adapt existing technologies, with a preference for those that are widely used. Such systems benefit from the insights and perspective of many users and will often have features in place before one even realizes they are needed. If a technology has widespread adoption, there will likely be a support group and knowledge base to learn from. Finally, it is best to choose a solution that has been tested by time. Longevity is a good measure of the robustness of the various parts and tools in the system.

HDF: 20 Years in Physical Sciences

Our requirements for high-performance data management and computation system are these:

  1. Different kinds of data need to be stored and accessed.
  2. The system must be able to organize data in different ways.
  3. Data will be stored in different combinations.
  4. Visualization and computational tools will access data quickly and randomly.
  5. Data storage must be scalable, efficient, and portable across computer platforms.
  6. The data model must be self describing and accessible to software tools.
  7. Software used to work with the data must be robust, and widely used.

HDF5 is a natural fit. The file format and software libraries are used in some of the largest data management projects known to date. Because of its strengths, HDF5 is independently finding its way into other bioinformatics applications and is a good choice for developing software to support NGS.

HDF5 software provides a common infrastructure that allows different scientific communities to build specific tools and applications. Applications using HDF5 typically contain three parts: one or more HDF5 files to store data, a library of software routines to access the data, and the tools, applications and additional libraries to carry out functions that are specific to a particular domain. To implement an HDF5-based application, a data model be developed along with application specific tools such as user interfaces and unique visualizations. While implementation can be a lot of work in its own right, the tools to implement the model and provide scalable, high-performance programmatic access to the data have already been developed, debugged, and delivered through the HDF I/O (input/output) library.

In earlier posts, we presented examples where we needed to write software to parse fasta formatted sequence files and output files from alignment programs. These parsers then called routines in the HDF I/O library to add data to the HDF5 file. During the import phase, we could set different compression levels and define the chunk size to compress our data and optimize access times. In these cases, we developed a simple data model based on the alignment output from programs like BWA, Bowtie, and MapReads. Most importantly, we were able to work with NGS data from multiple platforms efficiently, with software that required weeks of development rather than the months and years that would be needed if the system was built from scratch.

While HDF5 technology is powerful "out-of-the-box," a number of features can still be added to make it better for bioinformatics applications. The BioHDF project is about making such domain-specific extensions. These are expected to include modifications to the general file format to better support variable data like DNA sequences. I/O library extensions will be created to help HDF5 "speak" bioinformatics by creating APIs (Application Programming Interfaces) that understand our data. Finally, sets of command line programs and other tools will be created to help bioinformatics groups get started quickly with using the technology.

To summarize, the HDF5 platform is well-suited for supporting NGS data management and analysis applications. Using this technology, groups will be able to make their data more portable for sharing because the data model and data storage are separated from the implementation of the model in the application system. HDF5's flexibility for the kinds of data it can store, makes it easier to integrate data from a wide variety of sources. Integrated compression utilities and data chunking make HDF5-based systems as scalable as they can be. Finally, because the HDF5 I/O library is extensive and robust, and the HDF5 tool kit includes basic command-line and GUI tools, a platform is provided that allows for rapid prototyping, and reduced development time, thus making it easier to create new approaches for NGS data management and analysis.

For more information, or if you are interested in collaborating on the BioHDF project, please feel free to contact me (todd at geospiza.com).

Labels: , , ,

Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights
HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Background of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.

Labels: , , , ,

Monday, June 22, 2009

Sneak Peak: RNA-Seq - Global Profiling of Gene Activity and Alternative Splicing

Join us June 30 at 10:00 am PDT. Eric Olson, Geospiza's VP of Product Development will present an interesting webinar on using RNA-Seq to measure gene expression and discover alternatively spliced messages using GeneSifter Analysis Edition.

Abstract

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA-Seq data analysis process with emphasis on calculating gene and exon level expression values as well as identifying splice junctions from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and ABI’s SOLiD instruments.

To register visit the Geospiza webex event page.

Labels: , ,

Monday, June 15, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part III: The HDF5 Advantage

The Next Generation DNA Sequencing (NGS) bioinformatics bottleneck is related to the complexity of working with the data, analysis programs, and numerous output files that are produced as the data are converted from images to final results. Current systems lack well-organized data models and corresponding infrastructures for storing data and analysis information resulting in significant levels of data processing, reprocessing, and data copying redundancies. Systems can be improved with data management technologies like HDF5.

In this third installment of the bloginar, results from our initial work with HDF5 are presented. Previous posts have provided an introduction to the series and background on NGS.

Working with NGS data

With the exception of de novo sequencing, in which novel genomes, or transcriptomes, are analyzed, many NGS applications can be thought of as quantitative assays where DNA sequences are highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation of the bases between sequences in the alignments. Data tables from samples that differ by experimental treatment, environment, or populations, are compared in different ways to make additional discoveries and draw conclusions. Whether the assay is to measure gene expression, study small RNAs, understand gene regulation, or quantify genetic variation, a similar process is followed:
  1. A large dataset of sequence reads are collected in a massively parallel format.
  2. The reads are aligned to reference data.
  3. Alignment data, stored in multiple kinds of output files, are parsed, reformatted, and computationally organized to create assay specific reports.
  4. The reports are reviewed and decisions are made for how to work with the next sample, experiment, or assay.
Current practices where multiple programs create different kinds of information, formatted in different ways make this process difficult even for single samples. Achieving our main goal of comparing the analysis for multiple samples at a time is harder still. Presently, the four steps listed above must be repeated for each sample and multiple reports, that list expression values for single genes, or describe positional frequencies of read density, must be combined in some fashion to create new views to summarize or compute the differences and similarities between datasets. In the case of gene expression, for example, volcano plots can be used to compare the observed changes in gene expression with the likelihood that the observed changes are statistically significant. For a given gene, one might also want to drill into details that show how the reads align to the gene’s reference sequence. Further, the alignments from different samples for the gene need to be compared to see if there is evidence of alternative splicing or other interesting features.

Creating software that allows one to view NGS results in single and multi sample contexts, drill into multiple levels of detail, operates quickly and smoothly, and makes it possible for IT administrators and PIs to predict development and research costs, requires us to store raw data and corresponding analysis results in structures that support the computational needs for the problems being addressed. To accomplish this goal, we can either develop a brand new infrastructure to support our technical requirements and build new software to support our applications, or we can build software on an existing infrastructure and benefit from the experience gained in solving similar problems in other scientific fields.

Geospiza is following the latter path and is using an open-source technology, HDF5 (hierarchical data format), to develop highly scalable bioinformatics applications. Moreover, as we have examined past practices and consider our present and future challenges, we have concluded that technologies like HDF5 have great benefits for the bioinformatics field. Toward this goal, Geospiza has initiated a research program collaborating with The HDF Group to develop extensions to HDF5 that meet specific requirements for genetic analysis.

HDF5 Advantages

Introducing a new technology into an infrastructure requires work. Existing tools need to be refactored and new development practices must be learned. The cost of switching over to new technology has direct development costs associated with refactoring tools and learning new environments as well as a time lag between learning the system and producing new features. Justifying such a commitment demands a return on investment. Hence, the new technology must offer several advantages over current practices, such as improved systems performance and (or) new capabilities that are not easily possible with existing approaches. HDF5 offers both.

With HDF5 technology, we will be able to create better performing NGS data storage and high performance data processing systems, and approach data analysis problems differently.

We'll consider system performance first. Current NGS systems store reads and associated data primarily in text-based flat files. Additionally, the vast majority of alignment programs also store data in text-based flat files, creating the myriad of challenges described earlier. When these data are, instead, stored in HDF5, a number of improvements can be acheived. Because the HDF5 software library and file format can store data as compressed “chunks,” we can reduce storage requirements and access subsets of data more efficiently. For example, read data can be stored in arrays making it possible to quickly compute values like nucleotide frequency statistics for each base position in the reads from an entire multimillion read dataset.

In the example presented, 9.3 million Illumina GA reads were stored in HDF5 as a compressed two dimensional array resulting in a four fold reduction in size when compared to the original fasta formatted file. When the reads were aligned to a human genome reference, the flat file system grew from 609 MB to 1033 MB. The HDF5-based system increased in size by 230 MB to a total of 374 MB for all data and indices combined. In this simple example, the storage benefits of HDF5 are clear.

We can also demonstrate the benefits of improving the efficiency of accessing data. A common bioinformatics scenario is to align a set of sequences (queries) to a set of reference sequences (subjects) and then examine how the query sequences compare to the subject sequence within a specific range. Software routines accomplish this operation by getting the name (or ID) of a subject sequence along with the beginning and ending positions of the desired range(s). This information is used to first search the set of alignments for the names (or IDs) of the query sequences that match and query’s beginning and ending positions that match in the alignment. Next, the dataset of query sequences is searched to retrieve the matching data. When the data are stored in a non-indexed flat file, the entire file must be read to find the matching sequences. This takes, on average, half of the time needed to read the entire file. In contrast, indexed data can be accessed in a significantly reduced amount of time. The shorter time derives from two features: 1. A smaller amount of data needs to be read to conduct the search, and 2. Structured indices make searches more efficient.

In our example, the 9.3 million reads produced many millions of alignments when the data were compared to the human genome. We tested the performance for retrieving read alignment information from different kinds of file systems by accessing the alignments from successively smaller regions of chromosome 5. The entire chromosome contained roughly one million alignments. Retrieving the reads from the entire chromosome was slightly more efficient in HDF5 than retrieving the same data from the flat file system. However, as fewer reads were retrieved from smaller regions, the HDF5-based system demonstrated significantly better performance. For HDF5, the time to retrieve reads decreased as a function of the amount of data being retrieved down to 15 ms, the minimum overhead of running the program that accesses the HDF5 file. When compared to the minimum access time for the flat file (735 ms), a ~50 fold improvement is observed. As datasets continue to grow, the overhead for using the HDF5 system will remain at 15 ms, whereas the overhead for flat file systems will continue to increase.

The demonstrated performance advantages are not unique to HDF5. Similar results can be achieved by creating a data model to store the reads and alignments and implementing the model in a binary file format with indices to access the stored data in random ways. A significant advantage of HDF5 is that the software to implement the data models, build multiple kinds of indices, compress data in chunks, and read and write the data to and from the file has already been built, debugged, and supported by over 20 years of development. Hence, one of the most significant performance advantages associated with using the HDF platform is the savings in development time. To reproduce a similar, perhaps more specialized, system would require many months (even years) to develop, test, document, and refine the low-level software needed to make the system well-performing, highly scalable, and broadly usable. In our experience with HDF5, we’ve been able to learn the system, implement our data models, and develop the application code in a matter of weeks.

Consequently, we are spending more of our time solving the interesting challenges associated with analyzing millions of NGS reads from 100's or 1000's of samples to measure gene expression, identify alternatively spliced and small RNAs, study regulation, calculate sequence variation, and link summarized data to its underlying details, and we are spending a much smaller fraction of our time optimizing low-level infrastructures.

Additional examples of how HDF5 is changing our thinking will be presented next.

Labels: , ,