Tuesday, April 13, 2010

Bloginar: Standardizing Bioinformatics with BioHDF (HDF5)

Yesterday we (The HDF Group and Geospiza) released the BioHDF prototype software.  To mark the occasion, and demonstrate some of BioHDF’s capabilities and advantages, I share the poster we presented at this year’s AGBT (Advances in Genome Biology and Technology) conference.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, specific aspects of the general Next Generation Sequencing (NGS) workflow, and HDF5’s advantages for working with large amounts of NGS data.
 
Section 1.  The first section introduces HDF5 (Hierarchical Data Format) as a software platform for working with scientific data.  The introduction begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multi-step processes that, 3) compare NGS data to multiple reference sequence databases, 4) the resulting datasets of alignments must be visualized in multiple ways, and 5) scientific knowledge is gained when many datasets are compared. 

Next, choices for managing NGS data are compared in a four category table.  These include text and binary formats. While text formats (delimited and XML) have been popular for bioinformatics, they do not scale well and binary formats are gaining in popularity. The current bioinformatics binary formats are listed (bottom left) along with a description of their limitations. 

The introduction closes with a description of HDF5 and its advantages for supporting NGS data management and analysis. Specifically, HDF5 is platform for managing scientific data. Such data are typically complex and consist of images, large multi-dimensional arrays, and meta data. HDF5 has been used for over 20 years in other data intensive fields; it is robust, portable, and tuned for high performance computing. Thus HDF5 is well suited for NGS. Indeed, groups from academic researchers to NGS instrument vendors, and software companies are recognizing the value of HDF5.
Section 2. This section illustrates how HDF5 facilitates primary data analysis. First we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). In many NGS assays, secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights. 

The remaining portion of section 2 shows how Illumina GA and SOLiD primary data (reads and quality values) can be stored in BioHDF and later reviewed using the BioHDF tools and scripts.  The resulting quality graphs are organized into three groups (left to right) to show base composition plots, quality value (QV) distribution graphs, and other summaries.

Base composition plots show the count of each base (or color) that occurs at a given position in the read. These plots are used to assess overall randomness of a library and observe systematic nucleotide incorporation errors or biases.

Quality value plots show the distribution of QVs at each base position within the ensemble of reads. As each NGS run produces many millions of reads, it is worthwhile summarizing QVs in multiple ways. The first plots, from the top, show the average QV per base with error bars indicating QVs that are within one standard deviation of the mean. Next, box and whisker plots show the overall quality distribution (median, lower and upper quartile, minimum and maximum values) at each position. These plots are followed by “error” plots which show the total count of QVs below certain thresholds (red, QV < 10; green QV < 20; blue, QV < 30). The final two sets of plots show the number of QVs at each position for all observed values and the number of bases having each quality value.

The final group of plots show overall dataset complexity, GC content (base space only), average QV/read, and %GC vs average QV (base space only).  Dataset complexity is computed by determining the number of times a given read exactly matches other reads in the dataset. In some experiments, too many identical reads indicates a problem like PCR bias. In other cases, like tag profiling, many identical reads are expected from highly expressed genes. Errors in the data can artificially increase complexity.
Section 3.  Primary data analysis gives us a picture of how well the samples were prepared or how well the instrument ran with some indication about sample quality. Secondary and tertiary analysis tell us about sample quality and more importantly, provides biological insights. The third section focuses on secondary and tertiary analysis and begins with a brief cartoon showing a high level data analysis workflow using BioHDF to store primary data, alignment results, and annotations. BioHDF tools are used to query these data and other software within GeneSifter is used to compare data between samples and display the data in interactive reports to examine the details from single or multiple samples.

The left side of this section illustrates what is possible with single samples. Beginning with a simple table that indicates how many reads align to each reference sequence, we can drill into multiple reports that provide increasing detail about the alignments. For example, the gene list report (second from top) uses gene model annotations to summarize the alignments for all genes identified in the dataset. Each gene is displayed as a thumbnail graphic that can be clicked to see greater detail, which is shown in the third plot. The Integrated Gene View not only shows the density of reads across the gene's genomic region, but also shows evidence of splice junctions, and identified single base differences (SNVs) and small insertions and deletions (indels). Navigation controls provide ways to zoom into and out of the current view of data, and move to new locations. Additionally, when possible, the read density plot is accompanied by an Entrez gene model and dbSNP data so that data can be observed in a context of known information. Tables that describe the observed variants follow. Clicking on a variant drills into the alignment viewer to show the reads encompassing the point of variation.

The right side illustrates multi-sample analysis in GeneSifter. In assays like RNA-Seq, alignment tables are converted to gene expression values that can be compared between samples. Volcano (top) and other plots are used visualize the differences between the datasets. Since each point in the volcano plot represents the difference in expression for a gene between two samples (or conditions), we can click on that point to view the expression details for that gene (middle) in the different samples. In the case of RNA-Seq, we can also obtain expression values for the individual exons with the gene, making it possible to observe differential exon levels in conjunction with overall gene expression levels (middle). Clicking the appropriate link in the exon expression bar graph, takes us to the alignment details for the samples being analyzed (bottom), in this example we have two cases and two control replicates. Like the single sample Integrated Gene Views, annotations are displayed with alignment data. When navigation buttons are clicked all of the displayed genes move together so that you can explore the gene's details and surrounding neighborhood for multiple samples in a comparative fashion.
Section 4.  The poster closes with details about BioHDF.  First, the data model is described. An advantage of the BioHDF model is that read data are organized non-redundantly. Other formats, like BAM, tend to store reads with alignments and if a read has multiple alignments in a genome, or is aligned to multiple reference sequences, it gets stored multiple times. This may seem trivial, but anything that can happen a million times, becomes noticeable. This fact is demonstrated in the in table listed in the second panel “High Performance Computing Advantages.”  Other HDF5 advantages are listed below the performance stats table.  Most notably is HDF5’s ability to easily support multiple indexing schemes like nested containment lists (NClists). NClists solve the problem of efficiently accessing reads from alignments that may be contained in other alignments, which I will save for a later post.

Finally, the poster is summarized with a number of take home points. These reiterate the fact that NGS is driving the need to use binary file formats to manage NGS and analysis results and that HDF5 provides an attractive solution because of its long history and development efforts that specifically target scientific programming requirements. In our hands, HDF5 has helped make GeneSifter a highly scalable and interactive web-application with less development effort than would have been needed to implement other technologies.  

If you are software developer and are interested in BioHDF please visit www.biohdf.org.  If you do not want to program and instead, want a way to easily analyze your NGS data to make new discoveries, please contact us

Labels: , , , , , , , ,

Wednesday, February 17, 2010

Standardizing the Next Generation of Bioinformatics Software Development With BioHDF (HDF5)

AGBT is next week, and well be there presenting a poster on our latest and greatest work with HDF5 and BioHDF tools. For those of you attending, check out the poster. For those unable to attend, check back later for the "Bloginar."

Abstract

Next Generation Sequencing technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. However, the lack of scalable data analysis capabilities limits their potential. Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and methods to reduce data storage. HDF5 is also more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications.

In the poster we will present:
  1. An overview of NGS data analysis and workflows
  2. A prototype data model for working with NGS data
  3. Practical examples of data analysis and viewing information using the underlying framework
  4. Performance benchmarks comparing HDF5 to other file formats 

Labels: , ,

Monday, January 18, 2010

Systems Biology with HDF5

As many are aware, Geospiza and The HDF Group are collaborating to extend HDF (Hierarchical Data Format) technologies to support the data management needs of high performance computing applications in genomics. As we do this work, others are also adopting HDF5 as a data storage technology to work with different kinds of biological data.

The Association for Computing Machinery (ACM) recently published an article, "Unifying Biological Image Formats with HDF5," that argues for using HDF5 and HDF tools as a common framework for working with image files. This article is worth reading for several reasons.

First, it provides a nice introduction and background to HDF5, its origins, and movement towards becoming an ISO standard. HDF5's technical features are also included in this discussion.

Next, a brief history of the imaging community is covered to share how X-ray crystallographers, electron, and optical microscopists had all independently considered HDF5 as a framework for their next-generation image file formats. Through this discussion, the challenges that have been identified within the imaging community are listed.

Like genomics, the amounts of data being collected are ever increasing, current formats are inflexible and difficult to adapt to future modalities and dimensionality, and the nonarchival quality of data undermines long-term value. That is, current data typically lack sufficient metadata about their origins and experiments to be useful in the long-term.

The article goes on to make the point that current challenges with image data could be addressed if the community adopts an existing format that can support both generic and specialized data formats and meet a set of common requirements related to performance, interoperability, and archiving. Examples of how HDF5 meets these requirements are included. Briefly, HDF5's data caching can be used to overcome computation bottlenecks related to the fact that image sizes are exceeding RAM capacity. Interoperability issues can be addressed through HDF5's ability to store multiple metadata schemas in flexible ways. And, because HDF5 is self describing, data stored in HDF5 can be better preserved.

Finally, a barrier to moving to a new technology is supporting legacy applications that may be costly to replace. Thus, the article closes with a creative proposal for supporting legacy software applications and recommendations for future development. HDF5 files could support legacy software applications if they were able to present the data, stored within the HDF5 file, as the collection of directories and files required by the legacy application. This could be accomplished by developing an abstraction layer that could interact with FUSE (Filesystem in User Space) and essentially mount the HDF5 file as a virtual file system. Such a scenario is only possible because data are stored in HDF5 in a general way that can be further abstracted and presented in multiple specific ways.

While this article focused on issues related to image formats, there are many parallels that the genomics and Next Generation Sequencing communities should pay attention to, and if you are a bioinformatics software developer or running bioinformatics projects, you should put this paper on your must read list.

Labels: , , ,

Wednesday, January 13, 2010

2010 sequencing starts in style

Next Generation Sequencing (NGS) is a hot topic. As we kick off 2010, many themes continue. Data throughput is increasing, sequencing costs are decreasing, and NGS still requires extensive informatics support.

Throughput up, costs down

As sequencing throughput increases, the costs for collecting sequencing data decrease. Illumina is setting the pace for 2010 by announcing its latest sequencing instrument, the HiSeq2000. Illumina’s press release, news reports, and the blogosphere enthusiastically report on the instrument’s five fold increase in data throughput and ability to sequence an entire human genome in about one week for about $10,000.

What about the informatics?

This month’s reviews and editorials in Nature Reviews Genetics (NRG) and Nature Biotechnology (NBT), respectively, claim that the most significant NGS challenge continues to be dealing with the data. As pointed out in the NRG editorial, it is quite possible that the community will produce more sequence data this year than has been cumulatively produced in the past 10 years. The HiSeq, developments that will be announced by Applied Biosystems in February, and the coming single molecule sequencers support this. The editorial further makes the point that genome centers have the computing infrastructure to deal with the data, but the larger community of researchers, who could benefit from these technologies, do not. A similar observation was made at the end of the NBT review which pointed out that costs associated with downstream handling and processing of the data will possibly equal or exceed data collection costs.

The significance of the informatics challenge is that wide adoption of NGS technologies assumes that we have usable solutions for working with the data. These solutions go beyond simply getting a computer cluster with a sequencing instrument. To be useful, that cluster needs to reside in an adequately air conditioned room, be operated by people who know how to work with cluster hardware and software and can also optimize networks to manage the flow of data. Other individuals are needed who can write programs and scripts to process the data, work with multiple database technologies, and develop scalable user interfaces to visualize and navigate through the results and compare information between multiple samples and experiments.

The conversation about the informatics problem began with the introduction of NGS technologies. In 2008, Nature Methods (July) and NBT (October) published editorials speaking to the coming challenges. Later in 2009, Science published a new article about data intensive science. Previous FinchTalks have discussed the articles and their significance and the theme has remained the same; both the access to computing technologies and the skills needed to use the data are unavailable to the large numbers of researchers who need to use these technologies to remain competitive.

There is a solution

One solution to the informatics challenge created by NGS, and other data intensive technologies, is to make use of the immense Internet-based computing infrastructure that has been created by companies like Amazon, Google, Yahoo, and others. Also called Cloud Computing, Internet-based services remove many of the hardware and infrastructure barriers for utilizing high performance computing and storage technology. This message was delivered by the 2010 NBT kick off editorial and accompanying news feature, along with the next important message that software solutions also need to be adapted to cloud environments. Here the editorial, like many other descriptions of NGS informatics needs, falls short in that they only focus on alignment programs. Simply adapting alignment algorithms using technologies like Hadoop to employ Cloud-based high performance computing clusters is not a sufficient solution.

Aligning billions of reads to reference data quickly and accurately is clearly important. However it is just the first step of a complex analysis process. The subsequent steps of analyzing the billions of alignments to filter artifacts, identify true and new variation between sequences, discover alternative splice forms in transcripts, and compare data between samples are even more challenging.

Fortunately Geospiza understands the problem well. As our tag line, From Samples to ResultsTM, suggests, our lab and analysis systems focus on solving a complete set of problems that need to be addressed in order to do good science with NGS and other genetic analysis technologies.

Perhaps this is way we were the only software provider discussed in the NBT news feature, “Up in a cloud.”

Labels: , , ,

Thursday, December 31, 2009

2009 Review

The end of the year is a good time to reflect, review accomplishments, and think about the year to come. 2009 was a good year for Geospiza’s customers, with many exciting accomplishments for the company. Highlights are reviewed below.

Two products form a complete genetic analysis system


Geospiza’s two core products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), help laboratories do their work and scientists analyze their data. GSLE is the LIMS (Laboratory Information Management System) that laboratories, from service labs to high-throughput data production centers, use to collect information about samples, track and manage laboratory procedures, organize and process data, and deliver data and results back to researchers. GSLE supports traditional DNA sequencing (Sanger), fragment analysis, genotyping, microarrays, Next Generation Sequencing (NGS) and other technologies.

In 2008, Geospiza released the third version of the platform (back then it was known as FinchLab). This version launched a new way of providing LIMS solutions. Traditional LIMS systems require extensive programming and customization to meet a laboratory’s specific requirements. They include a very general framework designed to support a wide range of activities. Their advantage is that they are highly customizable. However, this advantage comes at the expense of very high acquisition costs accompanied by lengthy requirements planning and programming before they become operational.

In contrast, GSLE contains default settings that support genetic analysis out-of-the-box, while allowing laboratories to customize operations without programmer support. Default settings in GSLE suppport DNA sequencing, microarray, and genotyping services. The GSLE abstraction layer supports extensive configuration to meet specific needs as they arise. Through this design, the costs of acquiring and operating a high-quality advanced LIMS system are significantly reduced.

Throughout 2009, 100’s of features were added to GSLE to increase support for instruments and data types, and improve how laboratory procedures (workflows) are created, managed, and shared. Enhancements were made to features like experiment ordering, organization, and billing. We also added new application programming interfaces (APIs) to enable integration with enterprise software. Specific highlights included:
  • Extending microarray support to include sample sheet generation and automate uploading files
  • Improving NGS file and data browsing to simplify the process of searching and viewing the 1000’s of files produced in Next Gen sequencing runs
  • Making NGS data downloads, of very large gigabase files, robust and easy
  • Adding worksets to group DNA and RNA samples in customized ways that facilitate laboratory processing
  • Creating APIs to utilize external password servers and programmatically receive data using GSLE form objects
  • Enhancing ways for groups to add HTML to pages to customize their look and feel
In addition to the above features, we’ve also completed development on methods to multiplex NGS samples and track MIDs (molecular identifiers and molecular barcodes), enter laboratory data like OD values and bead counts in batches, create orders with multiple plates, and access SQL queries through an API. Look for these great features and more in the early part of 2010.

GSAE

As noted, GSAE is Geospiza’s data analysis product. While GSLE is capable of running of running advanced data analysis pipelines, the primary focus of data analysis in GSLE is to provide quality control. Thus its data analyses and presentation focus on single samples. GSAE provides the infrastructure and tools to compare the results between samples. In the case of NGS, GSAE also provides more reports and data interactions. GSAE began as a web-based microarray data analysis platform making it well suited for NGS-based gene expression assays. Over 2009 many new features were added to extend its utility to NGS data analysis with a focus on whole transcriptome analysis. Highlights included:
  • Developing data analysis pipelines for RNA-Seq, Small RNA, ChIP-Seq, and other kinds of NGS assays
  • Adding tools to visualize and discover alternatively spliced transcripts in gene expression assays
  • Extending expression analysis tools to include interactive volcano plots, unbalanced two-way ANOVA computations
  • Increasing NGS transcriptome analysis capabilities to include variation detection and visualization
The above features fulfill the requirements needed to make a platform complete for both NGS and microarray-based gene expression analysis. And, the addition of variation detection and visualization lays the groundwork for GSAE to extend its market leadership to resequencing data analysis.

Geospiza Research

In 2009 Geospiza won two research awards in the form of Phase II STTR and Phase I SBIR grants. The STTR project is researching new ways to organize, compress, and access NGS data by adapting HDF technologies to bioinformatics. Through this work we are developing a robust data management infrastructure that supports our NGS sequencing analysis pipelines and interactive user interfaces. The second award targets NGS-based variation detection. This work began in the last quarter of the year, but is already delivering new ways to identify and visualize variants in RNA-Seq and whole transcriptome analysis.

To learn more about our progress in 2009, visit our news page. It includes our press releases and reports in the news, publications citing our software, and webinars where we have presented our latest and greatest.

As we close 2009, we especially want to thank our customers and collaborators for their support in making the year successful and we look forward to an exciting year ahead in 2010.

Labels: , , , ,

Tuesday, October 13, 2009

Super Computing 09 and BioHDF

Next month, Nov 16-20, we will be in Portland for Super Computing 09 - SC09. Join us at a Birds of a Feather (BoF) session to learn about developing bioinformatics applications with BioHDF. The session will be Wed. Nov 18 at 12:15 pm in room D139-140.

Developing Bioinformatics Applications with BioHDF

In this session we will present how HDF5 can be used to work with large volumes of DNA sequence data. We will cover the current state of bioinformatics tools that utilize HDF5 and proposed extensions to the HDF5 library to create BioHDF. The session will include a discussion of requirements that are being be considered to develop a data models for working with DNA sequence alignments to measure variation within sets of DNA sequences.

HDF5 is an open-source technology suite for managing diverse, complex, high-volume data in heterogeneous computing and storage environments. The BioHDF project is investigating the use of HDF5 for working with very large scientific datasets. HDF5 provides a hierarchical data model, binary file format, and collection of APIs supporting data access. BioHDF will extend HDF5 to support DNA sequencing requirements.

Initial prototyping of BioHDF has demonstrated clear benefits. Data can be compressed and indexed in BioHDF to reduce storage needs and enable very rapid (typically, few millisecond) random access into these sequence and alignment datasets, essentially independent of the overall HDF5 file size. Additional prototyping activities we have identified key architectural elements and tools that will form BioHDF.

The BoF session will include a presentation of the current state of BioHDF and proposed implementations to encourage discussion of future directions.

Labels: , , ,

Sunday, September 6, 2009

Open or Closed

A key aspect of Geospiza’s software development and design strategy is to incorporate open scientific technologies into the GeneSifter products to deliver user friendly access to best-of-breed tools used to manage and analyze genetic data from DNA sequencing, microarray, and other experiments.

Open scientific technologies include open-source and published academic algorithms, programs, databases, and core infrastructure software such as operating systems, web servers, and other components needed to build modern systems for data management. Unlike approaches that rely on proprietary software, Geospiza’s adoption of open platforms and participation in the open-source community benefits our customers in numerous ways.

Geospiza’s Open Source History

When Geospiza began in 1997, the company started building software systems to support DNA sequencing technologies and applications. Our first products focused on web-enabled data management for DNA sequencing-based genomics applications. Foundational infrastructure, such as the web-server, and application layer incorporated Apache and Perl. We were also leaders, in that our first systems operated on Linux, an open-source UNIX-based operating system. In those early days, however, we used proprietary databases such as Solid and Oracle because the open-source alternatives Postgres and MySQL were still lacking features needed to support robust data processing environments. As these products matured, we extended our application support to include Postgres to deliver cost-effective solutions for our customers. By adopting such open platforms we were able to deliver robust, high performing systems, rapidly at a reasonable cost.

In addition to using open-source technology as the foundation of our infrastructure, we also worked with open tools to deliver our scientific applications. Our first product, the Finch Blast-Server, utilized the public domain BLAST from NCBI. Where possible, we sought to include well-adopted tools for other applications such as base calling and sequence assembly and repeat masking, for which the source code was made available. We favored these kinds tools over developing our own proprietary tools, because it was clear that technologies emerging from communities like the genome centers would advance much quicker and be better tuned to the problems people were trying to address. Further, these tools, because of their wide adoption within their community and publication, received higher levels of scrutiny and validation than their proprietary counterparts.

Times Change

In the early days, many of the genome center tools were licensed by universities. As the bioinformatics field matured, open-source models for delivering bioinformatics software have become more popular. Led by NCBI and pioneered by organizations like TIGR (now JCVI) and the Sanger institute, the majority of useful bioinformatics programs are now being delivered open-source either under GPL, BSD like, or Perl Artistic style licenses (www.opensource.org). The authors of these programs have benefited from wider adoption of their programs and continued support from funding agencies like NIH. In some cases other groups are extending best-of-breed technologies into new applications.

A significant benefit of the increasing use of open-source licensing is that a large number of analytical tools are readily available for many kinds of applications. Today we have robust statistical platforms like R and BioConductor and several algorithms for aligning Next Gen Sequencing (NGS) data. Because these platforms and tools are truly open-source, bioinformatics groups can easily access these technologies to understand how they work and compare other approaches to their own. This creates a competitive environment for bioinformatics tool providers that drives improvements in algorithm performance and accuracy and the research community benefits greatly.

Design Choices

Early on, Geospiza recognized value incorporating tools from the academic research community into our user friendly software systems. Such tools were being developed in close collaboration with the data production centers that were trying to solve scientific problems associated with DNA sequence assembly and analysis. Companies developing proprietary tools designed to compete with these efforts were at a disadvantage, because they did not have real time access to conversations between biologists, lab specialists, and mathematicians needed to quickly develop the deep experience of working with biologically complex data. This disadvantage continues today. Further, the closed nature of proprietary software limits the ability to publish work and have critical peer review of the code needed to ensure scientific validation.

Our work could proceed more quickly because we did not have to invest in solving the research problems associated with developing algorithms. Moreover, we did not have to invest in proving the scientific credibility of an algorithm. Instead we could cite published references and keep our focus on solving problems associated delivering the user interfaces needed to work with the data. Our customers benefited by gaining easy access to best-of-breed tools and having the knowledge that they had a community to draw on to understand their scientific basis.

Geospiza continues its practice of adopting open best-of-breed technologies. Our NGS systems utilize multiple tools such as MAQ, BWA, Bowtie, MapReads and others. GeneSifter Analysis Edition utilizes routines from the R and BioConductor package to perform statistical computations to compare datasets from microarray and NGS experiments. In addition, we are addressing issues related to high performance computing through our collaboration with the HDF Group and the BioHDF project. In this case we are not only adopting open-source technology, but also working with leaders in the field to make open-source contributions of our own.

When you use Geospiza’s GeneSifter products you can be assured that you are using the same tools as the leaders in our fields to receive the benefits of reducing data analysis costs combined with the advantages of community support through forums and peer reviewed literature.

Labels: , , ,

Sunday, August 16, 2009

BioHDF on the Web

During the past spring and early part of summer, we presented our initial work using HDF5 technology to make next generation DNA sequencing data management scalable. The presentations are posted on web, along with other points of interest that are listed below.

Presentations:
Presentations by Mark Welsh, and myself can be found at SciVee.
Mark presents our poster at ISMB , and I present our work at the “Sequencing, Finishing and Analysis in the Future Meeting,” in Santa Fe.

We also presented at this and last year’s BOSC meetings that were held at ISMB. The abstracts and slides can be found at:

What others are thinking:
Real time commentary on the 2009 BOSC presentation can be found at friendfeed and another post. The Fisheye Perspective considers how HDF5 fits with semantic web tools.

HDF in Bioinformatics:
Check out Fast5 for using HDF5 to store sequences and base probablities.

BioHDF in the News:
Genome Web and Bioinform articles on HDF5 or referencing HDF5 include:

FinchTalk:
Links to FinchTalks about BioHDF from 2008 t0 present include:
Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. HDF Software :
You can learn more about HDF5 and get the software and tools at:

Labels: , , ,

Sunday, July 12, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part V: Why HDF5?

Through the course of this BioHDF bloginar series, we have demonstrated how the HDF5 (Hierarchical Data Format) platform can successfully meet current and future data management challenges posed by Next Generation Sequencing (NGS) technologies. We now close the series by discussing the reasons why we chose HDF5.

For previous posts, see:

  1. The introduction
  2. Project background
  3. Challenges of working with NGS data
  4. HDF5 benefits for working with NGS data

Why HDF5?

As previously discussed, HDF technology is designed for working with large amounts of complex data that naturally organize into multidimensional arrays. These data are composed of discrete numeric values, strings of characters, images, documents, and other kinds of data that must be compared in different ways to extract scientific information and meaning. Software applications that work with such data must meet a variety of organization, computation, and performance requirements to support the communities of researchers where they are used.

When software developers build applications for the scientific community, they must decide between creating new file formats and software tools for working with the data, or adapting existing solutions that already meet a general set of requirements. The advantage of developing software that's specific to an application domain is that highly optimized systems can be created. However, this advantage can disappear when significant amounts of development time are needed to deal with the "low-level" functions of structuring files, indexing data, tracking bits and bytes, making the system portable across different computer architectures, and creating a basic set of tools to work with the data. Moreover, such a system would be unique, with only a small set of users and developers able to understand and share knowledge concerning its use.

The alternative to building a highly optimized domain-specific application system is to find and adapt existing technologies, with a preference for those that are widely used. Such systems benefit from the insights and perspective of many users and will often have features in place before one even realizes they are needed. If a technology has widespread adoption, there will likely be a support group and knowledge base to learn from. Finally, it is best to choose a solution that has been tested by time. Longevity is a good measure of the robustness of the various parts and tools in the system.

HDF: 20 Years in Physical Sciences

Our requirements for high-performance data management and computation system are these:

  1. Different kinds of data need to be stored and accessed.
  2. The system must be able to organize data in different ways.
  3. Data will be stored in different combinations.
  4. Visualization and computational tools will access data quickly and randomly.
  5. Data storage must be scalable, efficient, and portable across computer platforms.
  6. The data model must be self describing and accessible to software tools.
  7. Software used to work with the data must be robust, and widely used.

HDF5 is a natural fit. The file format and software libraries are used in some of the largest data management projects known to date. Because of its strengths, HDF5 is independently finding its way into other bioinformatics applications and is a good choice for developing software to support NGS.

HDF5 software provides a common infrastructure that allows different scientific communities to build specific tools and applications. Applications using HDF5 typically contain three parts: one or more HDF5 files to store data, a library of software routines to access the data, and the tools, applications and additional libraries to carry out functions that are specific to a particular domain. To implement an HDF5-based application, a data model be developed along with application specific tools such as user interfaces and unique visualizations. While implementation can be a lot of work in its own right, the tools to implement the model and provide scalable, high-performance programmatic access to the data have already been developed, debugged, and delivered through the HDF I/O (input/output) library.

In earlier posts, we presented examples where we needed to write software to parse fasta formatted sequence files and output files from alignment programs. These parsers then called routines in the HDF I/O library to add data to the HDF5 file. During the import phase, we could set different compression levels and define the chunk size to compress our data and optimize access times. In these cases, we developed a simple data model based on the alignment output from programs like BWA, Bowtie, and MapReads. Most importantly, we were able to work with NGS data from multiple platforms efficiently, with software that required weeks of development rather than the months and years that would be needed if the system was built from scratch.

While HDF5 technology is powerful "out-of-the-box," a number of features can still be added to make it better for bioinformatics applications. The BioHDF project is about making such domain-specific extensions. These are expected to include modifications to the general file format to better support variable data like DNA sequences. I/O library extensions will be created to help HDF5 "speak" bioinformatics by creating APIs (Application Programming Interfaces) that understand our data. Finally, sets of command line programs and other tools will be created to help bioinformatics groups get started quickly with using the technology.

To summarize, the HDF5 platform is well-suited for supporting NGS data management and analysis applications. Using this technology, groups will be able to make their data more portable for sharing because the data model and data storage are separated from the implementation of the model in the application system. HDF5's flexibility for the kinds of data it can store, makes it easier to integrate data from a wide variety of sources. Integrated compression utilities and data chunking make HDF5-based systems as scalable as they can be. Finally, because the HDF5 I/O library is extensive and robust, and the HDF5 tool kit includes basic command-line and GUI tools, a platform is provided that allows for rapid prototyping, and reduced development time, thus making it easier to create new approaches for NGS data management and analysis.

For more information, or if you are interested in collaborating on the BioHDF project, please feel free to contact me (todd at geospiza.com).

Labels: , , ,

Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights
HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Background of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.

Labels: , , , ,

Monday, June 15, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part III: The HDF5 Advantage

The Next Generation DNA Sequencing (NGS) bioinformatics bottleneck is related to the complexity of working with the data, analysis programs, and numerous output files that are produced as the data are converted from images to final results. Current systems lack well-organized data models and corresponding infrastructures for storing data and analysis information resulting in significant levels of data processing, reprocessing, and data copying redundancies. Systems can be improved with data management technologies like HDF5.

In this third installment of the bloginar, results from our initial work with HDF5 are presented. Previous posts have provided an introduction to the series and background on NGS.

Working with NGS data

With the exception of de novo sequencing, in which novel genomes, or transcriptomes, are analyzed, many NGS applications can be thought of as quantitative assays where DNA sequences are highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation of the bases between sequences in the alignments. Data tables from samples that differ by experimental treatment, environment, or populations, are compared in different ways to make additional discoveries and draw conclusions. Whether the assay is to measure gene expression, study small RNAs, understand gene regulation, or quantify genetic variation, a similar process is followed:
  1. A large dataset of sequence reads are collected in a massively parallel format.
  2. The reads are aligned to reference data.
  3. Alignment data, stored in multiple kinds of output files, are parsed, reformatted, and computationally organized to create assay specific reports.
  4. The reports are reviewed and decisions are made for how to work with the next sample, experiment, or assay.
Current practices where multiple programs create different kinds of information, formatted in different ways make this process difficult even for single samples. Achieving our main goal of comparing the analysis for multiple samples at a time is harder still. Presently, the four steps listed above must be repeated for each sample and multiple reports, that list expression values for single genes, or describe positional frequencies of read density, must be combined in some fashion to create new views to summarize or compute the differences and similarities between datasets. In the case of gene expression, for example, volcano plots can be used to compare the observed changes in gene expression with the likelihood that the observed changes are statistically significant. For a given gene, one might also want to drill into details that show how the reads align to the gene’s reference sequence. Further, the alignments from different samples for the gene need to be compared to see if there is evidence of alternative splicing or other interesting features.

Creating software that allows one to view NGS results in single and multi sample contexts, drill into multiple levels of detail, operates quickly and smoothly, and makes it possible for IT administrators and PIs to predict development and research costs, requires us to store raw data and corresponding analysis results in structures that support the computational needs for the problems being addressed. To accomplish this goal, we can either develop a brand new infrastructure to support our technical requirements and build new software to support our applications, or we can build software on an existing infrastructure and benefit from the experience gained in solving similar problems in other scientific fields.

Geospiza is following the latter path and is using an open-source technology, HDF5 (hierarchical data format), to develop highly scalable bioinformatics applications. Moreover, as we have examined past practices and consider our present and future challenges, we have concluded that technologies like HDF5 have great benefits for the bioinformatics field. Toward this goal, Geospiza has initiated a research program collaborating with The HDF Group to develop extensions to HDF5 that meet specific requirements for genetic analysis.

HDF5 Advantages

Introducing a new technology into an infrastructure requires work. Existing tools need to be refactored and new development practices must be learned. The cost of switching over to new technology has direct development costs associated with refactoring tools and learning new environments as well as a time lag between learning the system and producing new features. Justifying such a commitment demands a return on investment. Hence, the new technology must offer several advantages over current practices, such as improved systems performance and (or) new capabilities that are not easily possible with existing approaches. HDF5 offers both.

With HDF5 technology, we will be able to create better performing NGS data storage and high performance data processing systems, and approach data analysis problems differently.

We'll consider system performance first. Current NGS systems store reads and associated data primarily in text-based flat files. Additionally, the vast majority of alignment programs also store data in text-based flat files, creating the myriad of challenges described earlier. When these data are, instead, stored in HDF5, a number of improvements can be acheived. Because the HDF5 software library and file format can store data as compressed “chunks,” we can reduce storage requirements and access subsets of data more efficiently. For example, read data can be stored in arrays making it possible to quickly compute values like nucleotide frequency statistics for each base position in the reads from an entire multimillion read dataset.

In the example presented, 9.3 million Illumina GA reads were stored in HDF5 as a compressed two dimensional array resulting in a four fold reduction in size when compared to the original fasta formatted file. When the reads were aligned to a human genome reference, the flat file system grew from 609 MB to 1033 MB. The HDF5-based system increased in size by 230 MB to a total of 374 MB for all data and indices combined. In this simple example, the storage benefits of HDF5 are clear.

We can also demonstrate the benefits of improving the efficiency of accessing data. A common bioinformatics scenario is to align a set of sequences (queries) to a set of reference sequences (subjects) and then examine how the query sequences compare to the subject sequence within a specific range. Software routines accomplish this operation by getting the name (or ID) of a subject sequence along with the beginning and ending positions of the desired range(s). This information is used to first search the set of alignments for the names (or IDs) of the query sequences that match and query’s beginning and ending positions that match in the alignment. Next, the dataset of query sequences is searched to retrieve the matching data. When the data are stored in a non-indexed flat file, the entire file must be read to find the matching sequences. This takes, on average, half of the time needed to read the entire file. In contrast, indexed data can be accessed in a significantly reduced amount of time. The shorter time derives from two features: 1. A smaller amount of data needs to be read to conduct the search, and 2. Structured indices make searches more efficient.

In our example, the 9.3 million reads produced many millions of alignments when the data were compared to the human genome. We tested the performance for retrieving read alignment information from different kinds of file systems by accessing the alignments from successively smaller regions of chromosome 5. The entire chromosome contained roughly one million alignments. Retrieving the reads from the entire chromosome was slightly more efficient in HDF5 than retrieving the same data from the flat file system. However, as fewer reads were retrieved from smaller regions, the HDF5-based system demonstrated significantly better performance. For HDF5, the time to retrieve reads decreased as a function of the amount of data being retrieved down to 15 ms, the minimum overhead of running the program that accesses the HDF5 file. When compared to the minimum access time for the flat file (735 ms), a ~50 fold improvement is observed. As datasets continue to grow, the overhead for using the HDF5 system will remain at 15 ms, whereas the overhead for flat file systems will continue to increase.

The demonstrated performance advantages are not unique to HDF5. Similar results can be achieved by creating a data model to store the reads and alignments and implementing the model in a binary file format with indices to access the stored data in random ways. A significant advantage of HDF5 is that the software to implement the data models, build multiple kinds of indices, compress data in chunks, and read and write the data to and from the file has already been built, debugged, and supported by over 20 years of development. Hence, one of the most significant performance advantages associated with using the HDF platform is the savings in development time. To reproduce a similar, perhaps more specialized, system would require many months (even years) to develop, test, document, and refine the low-level software needed to make the system well-performing, highly scalable, and broadly usable. In our experience with HDF5, we’ve been able to learn the system, implement our data models, and develop the application code in a matter of weeks.

Consequently, we are spending more of our time solving the interesting challenges associated with analyzing millions of NGS reads from 100's or 1000's of samples to measure gene expression, identify alternatively spliced and small RNAs, study regulation, calculate sequence variation, and link summarized data to its underlying details, and we are spending a much smaller fraction of our time optimizing low-level infrastructures.

Additional examples of how HDF5 is changing our thinking will be presented next.

Labels: , ,

Thursday, June 11, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part II: Background

Next Generation DNA Sequencing (NGS) technologies will continue to produce ever increasing amounts of complex data. In the old world of electrophoresis-based sequencing (a.k.a. Sanger) we could effectively manage projects and data growth by relying largely on improvements in computer hardware. NGS breaks this paradigm, we now must think more deeply about software and hardware systems as a whole solution. Bringing technologies, like HDF5, into the bioinformatics world provides a path to such solutions. In this installment of the bloginar (started with the previous post), the power and challenges associated with NGS data analysis are discussed.

The Power of Next Generation DNA Sequencing

Over the past couple of years, 100’s of journal articles demonstrating the power of NGS have been published. These articles have been accompanied by numerous press reports communicating the community’s excitement and selected quotes from those articles convey the transformational nature of these technologies. Experiments that, only a short time ago, required a genome center-styled approach can now be done by a single lab.

Transcriptome profiling and digital gene expression are tewo examples that illustrate the power of NGS. For a few thousand dollars, a single run of a SOLiD or Illumina sequencer can produce more ESTs (Expressed Sequence Tags) than exist in all of dbEST. Even more striking is that the data in dbEST were collected over many years and 100's million dollars were invested in getting the sequences. Also, it is important to understand that the data in dbEST represent only a relatively small number of molecules from a broad range of organisms and tissues. While dbEST has been useful for providing a glimpse into the complex nature of gene expression and, in the case of human, mouse and a few other organisms, mapping a reasonable number of genes, the resource is unable to provide the deeper details and insights into alternative splicing, antisense transcription, small RNA expression, and allelic variation that can only be achieved when a cell’s complement of RNA is sampled deeply. As recent literature demonstrates, each time an NGS experiment is conducted, we seem to get a new view about the complexity of gene expression and how genomes are transcribed. A similar story repeats when NGS is compared other gene expression measurement systems like microarrays.

Of course with the excitement about NGS, comes the frightening reports about how to handle the data. Just as one can go into the news and scientific literature to get great quotes about NGS power, one can get similar kinds of quotes describing the challenges associated with the increase in data production. Metaphors of fire hoses spraying data often accompany graphs showing exponential growth in data production and accumulation. When these growth curves are compared to Moore’s law, we learn that the data are accumulating at faster rates than improvements in hardware performance. So, those who plan to wait for data management and analysis problems to be solved by better computers, will have to wait a long time.

Analyzing data is harder than it looks

A significant portion of the NGS data problem discussion focuses on how to store the large amounts of data. While a major concern, there are even greater challenges with analyzing the data, this is often referred to as the “bioinformatics bottleneck.” To understand the reason for the bottleneck, we need to understand data NGS analysis. It is common to divide the high-level NGS analysis workflow that converts images into bases, and bases into results into three phases. In the first phase (primary analysis), millions of individual images are analyzed to create files containing millions of sequence reads. The resulting reads are then converted to biologically meaningful information, either by aligning the reads to reference data or assembling the reads into contigs (de novo sequencing). This second phase (secondary analysis) is application specific; each kind of sequencing experiment or assay requires a specific workflow in which reads are compared to different reference sources in multi-step iterative processes. The third phase (tertiary) consists of analyzing multiple datasets of information derived from the second phase in order to annotate genomes, measure gene expression, observe linked variants, or uncover epigenetic patterns.

The bioinformatics bottleneck stems primarily from the complexity of secondary analysis. Secondary analyses use several different alignment programs to compare reads to reference data. Each alignment program produces multiple output files, each containing a subset of information that must be interpreted to understand what the data mean. The output files are large in their own right. A few million reads, for example, can produce many tables of alignment data consisting of many millions of rows. It is not possible to get meaning out of such files without additional software to “read” the data. And, no two alignment programs produce data in a common format. To summarize, secondary analysis utilizes multiple programs, each program produces a varied number of files with data in different formats, and the files are too large and complex to be understood by reading the data by eye. More software is needed.

From a computational and software engineering stand point, there is room for improvement. When the common secondary analysis processes are analyzed further, we see that the many large hard to read files are the tip of the iceberg. The lack of well-organized data models and infrastructure for storing structured information results in significant levels of data processing and data copying redundancy. This is due to the need to represent common information in different contexts. The problem is, that with data throughput scaling faster than Moore’s law, maintaining systems with this kind of complexity and redundancy becomes unpredictably costly in terms of system performance, storage needs, and software development time. Ultimately, it limits what we can do with the data and the questions that can be asked.

The next post discusses how this problem can be addressed.

Labels: , ,

Tuesday, April 7, 2009

Have we been deluged?

In early March, Science published a perspective “Beyond the Data Deluge [1].” Last October, Nature Biotechnology (NBT) published an editorial “Prepare for the Deluge [2].” Did we miss the deluge?

Perhaps we are being deluged now


The articles' general themes were centered around the issue that scientists are having to work with more data than ever before. The NBT editorial focused on Next Generation DNA Sequencing (NGS) and covered many of the issues that we’ve also identified as important [3]. Selected quotes illustrate the challenges:
“Beyond the data management issues, significant computational infrastructure is also required.”

“In this respect, the open SOLiD software development community (http://solidsoftwaretools.com/gf/), which in July recruited two new software companies, is a step in the right direction.”

“For users, it would also be beneficial if data formats and the statistics for comparing performance of the different instruments could be standardized."

And one of my favorites: “At the moment, the cost of bioinformatics support threatens to keep the technology in the hands of the few. And this powerful technology clearly needs to be in the hands of many. If the data problem is not addressed, ABI’s SOLiD, 454’s GS FLX, Illumina’s GAII or any of the other deep sequencing platforms will be destined to sit in their air-conditioned rooms like a Stradivarius without a bow.”

We [biologists] are not alone.

Data-intensive science

The Science article presented a bigger picture. Written by luminaries (Gordon Bell, Tony Hey, and Alex Szalay) in the fields of computer science and astronomy, Bell, Hey, and Szalay point out that other fields like astronomy and particle physics have experiments generating petabytes of data per year. In bioinformatics, they noted how the extreme heterogeneity of data challenges scientists and how, in molecular biology, traditional hypothesis-led approaches are being replaced by a data-intensive inductive approaches. That is, collect a ton of data, observe patterns, and make discoveries. They go on to discuss how data “born digital” proliferate in files, spreadsheets, or databases and get scattered on hard drives and in Web sites, blogs, and wiki’s and how managing and curating these digital data is becoming increasingly burdensome. Ever have to move stuff around to make space, or spend too much time locating a file? What about your backups?

The article continues by discussing how data production is out pacing Moore’s Law (states that computer processors double in speed every 18 months or so), and how new kinds of clustered computing systems are needed to process the increasing flood of data. The late Jim Gray was a visionary in recognizing the need for new computer tools and defined a fourth research paradigm as data-intensive science.

Through the discussion a number of challenges to data-intensive science were presented. These are related to moving and accessing data, subtleties of organizing and defining data (also called schemas, ontologies, and semantics), and non-standard uses of technology. An example was cited where large database systems hold only pointers to files rather than the data within the files making direct analysis of the data impractical. A case study of astronomy community was provided to share how one discipline has successfully transitioned to data-intensive science.

In closing the article, the authors indicate that while, data-intensive science requires specialized skill and analysis tools, the research community should be optimistic because, today, more options exist for accessing storage and computing resources on demand through "cloud services" that are being built by the IT industry. Cloud services provide high-bandwidth cost-effective services, to provide computing capabilities that are beyond the financial scope of universities and national laboratories. According to the authors, this model offers an attractive solution to the scale problem, but requires radical thinking to make optimal use of these services. Bell and colleagues close with the following quote:
“In the future, the rapidity with which any given discipline advances is likely to depend on how well the community acquires the necessary expertise in database, workflow management, visualization, and cloud computing technologies.”
OK, we're radical

Geospiza been helping labs and research groups meet current and future bioinformatics challenges through a number of creative approaches. For example, Nature Biotechnology lauded the SOLiD community program; Geospiza was one of those first two companies. Geospiza is also actively participating in the community through consortium actives like SEQC to develop standardized approaches to working with NGS systems and data. We have useful ways of analyzing and visualizing information in NGS data and our BioHDF project is tackling the deeper issues related to working with large amounts of extremely heterogeneous data and high performance computing so we can go beyond simply referencing files from databases and analyze these data more extensively than is currently possible. Finally, our GeneSifter products have used cloud computing from the very beginning and we have been successfully extending the cloud computing model to NGS.

References and additional reading

[1] 2008. Prepare for the deluge. Nat Biotechnol 26, 1099.
[2] Bell G., Hey T., Szalay A., 2009. Computer science. Beyond the data deluge. Science 323, 1297-1298.

[3] FinchTalks:
Journal Club: Focus on Next Gen Sequencing
The IT problem
The bioinformatics problem

BioHDF FinchTalks:
BOSC Abstract and Presentation
Introducing BioHDF
The case for HDF
Genotyping with HDF

Labels: , ,

Tuesday, March 17, 2009

Introducing BioHDF

Today we released news of our funding for the BioHDF project. This project, in collaboration with The HDF Group, is focused on developing scalable bioinformatics technologies to support the many current and emerging Next Generation Sequencing (NGS) applications such as Transcription Profiling, Digital Gene Expression, Small RNA Analysis, Copy Number Variation and Resequencing.

You might ask, isn’t Geospiza already doing all of those things listed above? The answer is yes, but there is more to NGS work than meets the eye.

When we work with NGS data today, we do so with software and systems that are inefficient in three important areas: Space, Time, and Bandwidth. Although people are managing to get by, as throughput continues to increase, space, time and bandwidth issues are going become more acute and will lead to disproportionally higher software development and data management costs. BioHDF will help solve the space, time, and bandwidth issues at the infrastructure level.

Space issues are related to storing the data. As we know, NGS systems create a lot of data. Less understood is that when the data are processed the outputs of the alignment programs produce an amount of data equal or larger than the original input files. Thus, one challenge groups face, is that despite their plans for a certain amount of storage, after they've run some programs, they get surprised by finding they’ve run out of space, sometimes even in the middle of a one day program run :-(

Current practices for computing alignments also create time inefficiencies in detailing with NGS data. Today’s algorithms are improving, but still largely require that data be held in computer memory (RAM) during processing. When problems get too large, algorithms use disk space in a random way. Also known as swapping, this process kills performance and jobs must be terminated. In many cases, the problem is handled by breaking the problem down into smaller units for computation, writing scripts to track the pieces and steps, and putting the output files together at the end.

Additionally, many programs require that data be in a certain format prior to processing. As files get ever larger, the time needed to reformat data adds significantly to the computational burden.

Bandwidth is a measure of the data transfer rate. Small files transfer quickly, but when they get larger, there is a noticeable lag between the start and finish. With NGS, the data transfer rate becomes a significant factor in systems planning. Some groups have gone to great lengths to improve their networks when setting up their labs for NGS. In our cloud computing services, we use specialized software to improve data transfer rates and ensure transfers are complete because tools like ftp are not robust enough to reliably handle NGS data volumes.

BioHDF will help us work with space, time, bandwidth constraints.

Meeting the above challenge, requires that we have well performing software frameworks and underlying data management tools to store and organize data in better ways than complex mixtures of flat files and relational databases. Geospiza and The HDF Group are collaborating to develop open-source, portable, scalable, bioinformatics technologies based on HDF5 (Hierarchical Data Format). We call these extensible domain-specific data technologies “BioHDF.” BioHDF will implement a data model that supports primary DNA sequence information (reads, quality values, meta data) and the results from sequence alignment and variation detection algorithms. BioHDF will extend HDF5 data structures and library routines with new features (indexes, additional compression, graph layouts) to support the high performance data storage and computation requirements of Next Gen Sequencing.

Labels: ,

Wednesday, July 30, 2008

BioHDF at BOSC

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with Next Gen data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by the applications that analyze Next Gen DNA sequence data.

That was the theme of a talk I presented at the BOSC (Bioinformatics Open Source Conference) meeting that preceded ISMB (Intelligent Systems for Molecular Biology) in Toronto, Canada, July 19th. You can get the slides from the BOSC site. At the same time, we posted a blog on Genographia, a next-generation genomics community web site devoted to Next Gen sequencing discussions and idea sharing. The key points are summarized below.

Motivation

The BioHDF project is motivated by the fact that the next and future generations of data collection technologies, like DNA sequencing, are creating ever increasing amounts of data. Getting meaningful information from these data require that multiple programs be used in complex processes. Current practices for working with these data create many kinds of challenges, ranging from managing large numbers of files and formats to having the computation power and bandwidth to make calculations and move data around. These practices have a high cost in terms of storage, CPU, and bandwidth efficiency. In addition, they require significant human effort in understanding idiosyncratic program behavior and output formats

Is there a better way?

Many would agree that if we could reduce the number of file formats, avoid data duplication, and improve how we access and process data, we could develop better performing and more interoperable applications. Doing so requires improved ways of storing data and making it accessible to programs. For a number of years we have thought about these goals might be accomplished and looked to other data-intensive fields to see how others have solved these problems. Our search ended when we found HDF (hierarchical data format), a standard file format and library used in the physical and earth sciences.

BioHDF

HDF5 can be used in many kinds of bioinformatics applications. For specialized areas, like DNA sequencing, domain specific extensions will be needed. BioHDF is about developing those extensions, through community support, to create a file format and accompanying library of software functions that are needed to build the scalable software applications of the future. More will follow, if you are interested contact me: todd at geospiza.com.

Labels: , , , , ,

Sunday, March 2, 2008

Genotyping with HDF

To continue our progress describing HDF and its value in bioinformatics, I present the work Geospiza and THG performed in developing a prototype application for genotyping. In this project we implemented a data model, based on polyPhred, in HDF5 to demonstrate HDF5's data organization capabilities. We further demonstrated HDF5's strengths for compressing and accessing data by adding HapMap genotype data sets and data from chromosome level linkage disequilibrium calculations.

Organizing Data - A resequencing project begins with a region of a genome, a gene or set of genes that will be studied. A researcher will have sets of samples from patient populations from which they will isolate DNA. PCR is used to amplify specific regions of DNA so that both chromosomal copies can be sequenced. The read data, obtained from chromatograms, are aligned to other reads and also to a reference sequence. Quality and trace information are used to predict whether the differences observed between reads and reference data are meaningful. The general data organization can be broken into a main part called the Gene Model and within it, two sub organizations: the reference and experimental data. The reference consists of the known state of information. Resequencing, by definition, focuses on comparing new data with a reference model. The reference model organizes all of the reference data including the reference sequence and annotations.

The sub organizations of data can be stored in HDF5 using its standard features. The two primary objects in HDF5 are "groups" and "datasets." The HDF5 group object is akin to a UNIX directory or Windows folder – it is a container for other objects. An HDF5 dataset is an array structure for storing data and has a rich set of types available for defining the elements in an HDF dataset array. They include simple scalars (e.g., integers and floats), fixed- and variable-length structures (e.g., strings), and "compound datatypes." A compound datatype is a record structure, whose fields can be any other datatype, including other compound types. Datasets and groups can contain attributes, which are simple name/value pairs for storing metadata. Finally, groups can be linked to any dataset or group in an HDF file, making it possible to show relationships among objects. These HDF objects allowed us to create an HDF file whose structure matched the structure and content of the Gene Model structure. Although the content of a Gene Model file is quite extensive and complex, the grouping and dataset structure of HDF makes it very easy to see the overall organization of the experiment. Since all the pieces are in one place, an application, or someone browsing the file, can easily find and access the particular data of interest. The figure to the left shows the HDF5 file structure we used. The ovals represent groups and the rectangles represent datasets. The grayed out groups (Genotyping, Expression, Proteomics) were not implemented.

Accessing the Data - HDF5's feasibility, and several advantageous features, are demonstrated by a screen shot obtained using HDFView, a cross platform Java-based application, that can be used view data stored in HDF5. This image below highlights the ability of HDF5, and HDF5-supporting technologies to meet the following requirements:

  • Support combining a large number of files
  • Provide simple navigation and access to data objects
  • Support data analysis
  • Integrate diverse data types

The left most panel of the screen (below) presents an "explorer" view of four HDF5 files (HapMap, LD, ADRB2, and FVIII), with their accompanying groups and datasets. Today, researchers store these data in separate files scattered throughout file systems. To share results with a colleague, they e-mail multiple spreadsheets or tab-delimited text files for each table of data. When all of the sequence data, basecall tables, assemblies, and genotype information are considered, the number of files becomes significant. For ADRB2 we combined the data from 309 individual files into a single HDF5 file. For FVIII, a genotyping study involving 39 amplicons and 137 patient samples, this number grows to more than 60,000 primary and versioned copied files.

With HDF5 these data are encapsulated in a single file thus simplifying data management in increasing data portability.

Example screen from the prototype demo. HDFView, a JAVA viewer for HDF5 files can display multiple HDF5 files and for each file, the structure of the data in the file. Datasets can be shown as tables, line plots, histograms and images. This example shows a HapMap dataset, LD calculations for a region of chromosome 22 and the data from two resequencing projects. The HapMap dataset (upper left) is a 52,636-row table of alleles from chromosome 22. Below it is an LD plot from the same chromosome. The resequencing projects, adrb2 and factor 8, show reference data and sequencing data. The table (middle) is a subsection of the poly table obtained from Phred. Using the line plot feature in HDFView, sub sections of the table were graphed. The upper right graph compares the called base peak areas (black line, top) to the uncalled peak areas (red, bottom) for the entire trace. The middle right graph highlights the region between 250 and 300 bases. The large peak at position 36 (position 286 in the center table, and top graph) represents a mixed base. The lower right graph is a "SNPshot" showing the trace data surrounding the variation.

In addition to reducing file management complexity, HDF5 and HDFView have a number of data analysis features that make it possible to deliver research-quality applications quickly. In the ADRB2 case, the middle table in the screen shot is a section of one of the basecall tables produced by Phred using its "-d" option. This table was opened by selecting the parent table and defining the columns and region to display. As this is done via HDF5's API, it is easy to envision a program "pulling" relevant slices of data from a data object, performing calculations from the data slices and storing the information back as a table that can be viewed from the interface. Also important, the data in this example are accessible and not "locked" away in a proprietary system. Not only is HDF an open format, HDFView allows one to export data as HDF subsets, images, and tab delimited tables. HDFView's copy functions allow one to easily copy data into other programs like Excel.

HDFView can produce basic line graphs that can be used immediately for data analysis, such as the two that are shown here for ADRB2. The two plots, in the upper right corner of the screen show the areas of the peaks for called (black, upper line) and uncalled (red, lower line) bases. The polymorphic base can be seen in the top plot as a spike in the secondary peak area. The lower graph contains the same data, plotted from the region between base 250 and 300 of the read. This plot shows a high secondary peak with a concomitant reduction in the primary peak area. One of PolyPhred's criteria for identifying a heterozygous base, that primary and secondary peak areas are similar, is easily observed. The significance of this demonstration is that HDF5 and HDFView have significant potential in algorithm development, because they can provide rapid access to different views of the data.

More significantly HDFView was used without any modifications demonstrating the benefit of a standard implementation system like HDF5.

Combining Diverse Data - The screen shot also shows the ability of HDF5 to combine and present diverse data types. Data from a single file containing both SNP discovery projects are shown, in addition to HapMap data (chromosome 22) and an LD plot consisting of a 1000 x 1000 array of LD values from a region of chromosome 22.

As we worked on this project, we became more aware of the technology limitations that hinder work with the HapMap and very large LD datasets and concluded that the HapMap data would provide an excellent test case for evaluating the ability of HDF5 to handle extremely large datasets.

For this test, a chromosome 22 genotype dataset was obtained from HapMap.org. Uncompressed, this is a large file (~24MB), consisting of a row of header information followed by approximately 52,000 rows of genotyped data. As a text file, it is virtually indecipherable and needs to be parsed and converted to a tabular data structure before the data can be made useful. When one considers that even for the smallest chromosome, the dataset is close to Microsoft Excel's (2003, 2004) row limit (65,535), the barrier to entry for the average biologist wishing to the use this data is quite high.

To put the data into HDF5 we made an XML schema of the structure to understand the model, built a parser, and loaded the data. As can be seen in HDFView (Fig. 6), the data were successfully converted from a difficult-to-read text form to a well-structured table where all of
the data can be displayed. At this point, HDFView can be used to extract and analyze subsets of information.

Next, we asked if HDF5 could be used to observe long range LD at the chromosome level. This is an important question that cannot be answered by current technology. Using the r2 algorithm, we computed the LD values for the 53,000 SNPs in chromosome 22 and produced a 53,000 x 53,000 array of values. These data would require 5.2 Gigabytes using a conventional file format. Since most of the values in this array are "0", the file can compress quite well. However, with conventional "gzip" methods, it must be uncompressed in order to be displayed, even if one wants only to display a small part of the entire image. Not only does such an operation take a long time, but common computer configurations lack sufficient memory to store such a large uncompressed image.

The LD test demonstrates the power of HDF5's collection of sophisticated storage structures. We have seen that we can compress datasets inside an HDF5 file, but we also see that compressing an entire dataset creates access problems. HDF5 deals with this problem through a storage feature called "chunking." Whereas most file formats store an array as a contiguous stream of bytes, chunking in HDF5 involves dividing a dataset into equal-sized "chunks" that are stored and accessed separately.

LD plot for chromosome 22. A 1000 x 1000 point array of LD calculations is shown. The table in the upper right shows the LD data for a very small region of the plot, demonstrating HDF's ability to allow one to easily select "slices" of datasets.

Chunking has many important benefits, two of which apply particularly to large LD arrays. Chunking makes it possible to achieve good performance when accessing subsets of the datasets, even when the chosen subset is orthogonal to the normal storage order of the dataset. If a very large LD array is stored contiguously, most sub-setting operations require a large number of accesses, as data elements with spatial locality can be stored a long way from one another on a disk, requiring many seeks and reads. With chunking, spatial locality is preserved in the storage structure, resulting in faster access to subsets. When a data subset is accessed, only the chunks that contain specific portions of the data need to be un-compressed. The extra memory and time required to uncompress the entire LD array are both avoided.

Using both chunking and compression, HDF5 compressed the data in the chromosome 22 LD array from 5.2 gigabytes to 300 megabytes, a 17-fold decrease in storage space. Furthermore, the LD array was then immediately available for viewing in HDFView, where it was converted to an image with color intensity used to show higher linkage. The display also shows a table of LD values corresponding to a subset of the larger LD plot. In HDFView, one can "box" select a region of pixels from an image and use it to create a subset of data. This is an important feature, as it will be impossible to view an entire chromosome LD plot at single pixel resolution. Thus, a matrix of lower resolution regions will need to be created and viewed in HDFView. The lower resolution image can highlight regions of high LD and, using a tool like HDFView, one can then select those regions and drill down into the underlying data.

HDF5 has many practical benefits for bioinformatics. As a standardized file technology data models can be implemented and tools like HDFView can be used to quickly visualize the organization of the data and results of computation. Computational scientists can develop new algorithms faster because they do not have to invest time developing new formats and new GUIs to view their data. The community can benefit because data become more portable. Finally, HDF is well suited for enhancing application performance through data compression, chunking, and memory mapping. Many of these features will become extremely valuable as Next Gen technologies push the volumes of data to higher and higher levels.

Labels: , , ,