Sunday, March 2, 2008

Genotyping with HDF

To continue our progress describing HDF and its value in bioinformatics, I present the work Geospiza and THG performed in developing a prototype application for genotyping. In this project we implemented a data model, based on polyPhred, in HDF5 to demonstrate HDF5's data organization capabilities. We further demonstrated HDF5's strengths for compressing and accessing data by adding HapMap genotype data sets and data from chromosome level linkage disequilibrium calculations.

Organizing Data - A resequencing project begins with a region of a genome, a gene or set of genes that will be studied. A researcher will have sets of samples from patient populations from which they will isolate DNA. PCR is used to amplify specific regions of DNA so that both chromosomal copies can be sequenced. The read data, obtained from chromatograms, are aligned to other reads and also to a reference sequence. Quality and trace information are used to predict whether the differences observed between reads and reference data are meaningful. The general data organization can be broken into a main part called the Gene Model and within it, two sub organizations: the reference and experimental data. The reference consists of the known state of information. Resequencing, by definition, focuses on comparing new data with a reference model. The reference model organizes all of the reference data including the reference sequence and annotations.

The sub organizations of data can be stored in HDF5 using its standard features. The two primary objects in HDF5 are "groups" and "datasets." The HDF5 group object is akin to a UNIX directory or Windows folder – it is a container for other objects. An HDF5 dataset is an array structure for storing data and has a rich set of types available for defining the elements in an HDF dataset array. They include simple scalars (e.g., integers and floats), fixed- and variable-length structures (e.g., strings), and "compound datatypes." A compound datatype is a record structure, whose fields can be any other datatype, including other compound types. Datasets and groups can contain attributes, which are simple name/value pairs for storing metadata. Finally, groups can be linked to any dataset or group in an HDF file, making it possible to show relationships among objects. These HDF objects allowed us to create an HDF file whose structure matched the structure and content of the Gene Model structure. Although the content of a Gene Model file is quite extensive and complex, the grouping and dataset structure of HDF makes it very easy to see the overall organization of the experiment. Since all the pieces are in one place, an application, or someone browsing the file, can easily find and access the particular data of interest. The figure to the left shows the HDF5 file structure we used. The ovals represent groups and the rectangles represent datasets. The grayed out groups (Genotyping, Expression, Proteomics) were not implemented.

Accessing the Data - HDF5's feasibility, and several advantageous features, are demonstrated by a screen shot obtained using HDFView, a cross platform Java-based application, that can be used view data stored in HDF5. This image below highlights the ability of HDF5, and HDF5-supporting technologies to meet the following requirements:

  • Support combining a large number of files
  • Provide simple navigation and access to data objects
  • Support data analysis
  • Integrate diverse data types

The left most panel of the screen (below) presents an "explorer" view of four HDF5 files (HapMap, LD, ADRB2, and FVIII), with their accompanying groups and datasets. Today, researchers store these data in separate files scattered throughout file systems. To share results with a colleague, they e-mail multiple spreadsheets or tab-delimited text files for each table of data. When all of the sequence data, basecall tables, assemblies, and genotype information are considered, the number of files becomes significant. For ADRB2 we combined the data from 309 individual files into a single HDF5 file. For FVIII, a genotyping study involving 39 amplicons and 137 patient samples, this number grows to more than 60,000 primary and versioned copied files.

With HDF5 these data are encapsulated in a single file thus simplifying data management in increasing data portability.

Example screen from the prototype demo. HDFView, a JAVA viewer for HDF5 files can display multiple HDF5 files and for each file, the structure of the data in the file. Datasets can be shown as tables, line plots, histograms and images. This example shows a HapMap dataset, LD calculations for a region of chromosome 22 and the data from two resequencing projects. The HapMap dataset (upper left) is a 52,636-row table of alleles from chromosome 22. Below it is an LD plot from the same chromosome. The resequencing projects, adrb2 and factor 8, show reference data and sequencing data. The table (middle) is a subsection of the poly table obtained from Phred. Using the line plot feature in HDFView, sub sections of the table were graphed. The upper right graph compares the called base peak areas (black line, top) to the uncalled peak areas (red, bottom) for the entire trace. The middle right graph highlights the region between 250 and 300 bases. The large peak at position 36 (position 286 in the center table, and top graph) represents a mixed base. The lower right graph is a "SNPshot" showing the trace data surrounding the variation.

In addition to reducing file management complexity, HDF5 and HDFView have a number of data analysis features that make it possible to deliver research-quality applications quickly. In the ADRB2 case, the middle table in the screen shot is a section of one of the basecall tables produced by Phred using its "-d" option. This table was opened by selecting the parent table and defining the columns and region to display. As this is done via HDF5's API, it is easy to envision a program "pulling" relevant slices of data from a data object, performing calculations from the data slices and storing the information back as a table that can be viewed from the interface. Also important, the data in this example are accessible and not "locked" away in a proprietary system. Not only is HDF an open format, HDFView allows one to export data as HDF subsets, images, and tab delimited tables. HDFView's copy functions allow one to easily copy data into other programs like Excel.

HDFView can produce basic line graphs that can be used immediately for data analysis, such as the two that are shown here for ADRB2. The two plots, in the upper right corner of the screen show the areas of the peaks for called (black, upper line) and uncalled (red, lower line) bases. The polymorphic base can be seen in the top plot as a spike in the secondary peak area. The lower graph contains the same data, plotted from the region between base 250 and 300 of the read. This plot shows a high secondary peak with a concomitant reduction in the primary peak area. One of PolyPhred's criteria for identifying a heterozygous base, that primary and secondary peak areas are similar, is easily observed. The significance of this demonstration is that HDF5 and HDFView have significant potential in algorithm development, because they can provide rapid access to different views of the data.

More significantly HDFView was used without any modifications demonstrating the benefit of a standard implementation system like HDF5.

Combining Diverse Data - The screen shot also shows the ability of HDF5 to combine and present diverse data types. Data from a single file containing both SNP discovery projects are shown, in addition to HapMap data (chromosome 22) and an LD plot consisting of a 1000 x 1000 array of LD values from a region of chromosome 22.

As we worked on this project, we became more aware of the technology limitations that hinder work with the HapMap and very large LD datasets and concluded that the HapMap data would provide an excellent test case for evaluating the ability of HDF5 to handle extremely large datasets.

For this test, a chromosome 22 genotype dataset was obtained from HapMap.org. Uncompressed, this is a large file (~24MB), consisting of a row of header information followed by approximately 52,000 rows of genotyped data. As a text file, it is virtually indecipherable and needs to be parsed and converted to a tabular data structure before the data can be made useful. When one considers that even for the smallest chromosome, the dataset is close to Microsoft Excel's (2003, 2004) row limit (65,535), the barrier to entry for the average biologist wishing to the use this data is quite high.

To put the data into HDF5 we made an XML schema of the structure to understand the model, built a parser, and loaded the data. As can be seen in HDFView (Fig. 6), the data were successfully converted from a difficult-to-read text form to a well-structured table where all of
the data can be displayed. At this point, HDFView can be used to extract and analyze subsets of information.

Next, we asked if HDF5 could be used to observe long range LD at the chromosome level. This is an important question that cannot be answered by current technology. Using the r2 algorithm, we computed the LD values for the 53,000 SNPs in chromosome 22 and produced a 53,000 x 53,000 array of values. These data would require 5.2 Gigabytes using a conventional file format. Since most of the values in this array are "0", the file can compress quite well. However, with conventional "gzip" methods, it must be uncompressed in order to be displayed, even if one wants only to display a small part of the entire image. Not only does such an operation take a long time, but common computer configurations lack sufficient memory to store such a large uncompressed image.

The LD test demonstrates the power of HDF5's collection of sophisticated storage structures. We have seen that we can compress datasets inside an HDF5 file, but we also see that compressing an entire dataset creates access problems. HDF5 deals with this problem through a storage feature called "chunking." Whereas most file formats store an array as a contiguous stream of bytes, chunking in HDF5 involves dividing a dataset into equal-sized "chunks" that are stored and accessed separately.

LD plot for chromosome 22. A 1000 x 1000 point array of LD calculations is shown. The table in the upper right shows the LD data for a very small region of the plot, demonstrating HDF's ability to allow one to easily select "slices" of datasets.

Chunking has many important benefits, two of which apply particularly to large LD arrays. Chunking makes it possible to achieve good performance when accessing subsets of the datasets, even when the chosen subset is orthogonal to the normal storage order of the dataset. If a very large LD array is stored contiguously, most sub-setting operations require a large number of accesses, as data elements with spatial locality can be stored a long way from one another on a disk, requiring many seeks and reads. With chunking, spatial locality is preserved in the storage structure, resulting in faster access to subsets. When a data subset is accessed, only the chunks that contain specific portions of the data need to be un-compressed. The extra memory and time required to uncompress the entire LD array are both avoided.

Using both chunking and compression, HDF5 compressed the data in the chromosome 22 LD array from 5.2 gigabytes to 300 megabytes, a 17-fold decrease in storage space. Furthermore, the LD array was then immediately available for viewing in HDFView, where it was converted to an image with color intensity used to show higher linkage. The display also shows a table of LD values corresponding to a subset of the larger LD plot. In HDFView, one can "box" select a region of pixels from an image and use it to create a subset of data. This is an important feature, as it will be impossible to view an entire chromosome LD plot at single pixel resolution. Thus, a matrix of lower resolution regions will need to be created and viewed in HDFView. The lower resolution image can highlight regions of high LD and, using a tool like HDFView, one can then select those regions and drill down into the underlying data.

HDF5 has many practical benefits for bioinformatics. As a standardized file technology data models can be implemented and tools like HDFView can be used to quickly visualize the organization of the data and results of computation. Computational scientists can develop new algorithms faster because they do not have to invest time developing new formats and new GUIs to view their data. The community can benefit because data become more portable. Finally, HDF is well suited for enhancing application performance through data compression, chunking, and memory mapping. Many of these features will become extremely valuable as Next Gen technologies push the volumes of data to higher and higher levels.

Labels: , , ,

Tuesday, February 26, 2008

The case for HDF

As we contemplate working Next Gen data we need to think about how we are going to store data and information efficiently. You might ask, what does that mean? Common, a file is a file isn't it?

Wrong

When one considers ways to work with data on a computer two problems must be solved: The first is how to structure the data. This is also known as defining the data model or format. The second is how to implement the data model. The data model describes data entities, the attributes of an entity (time, length, type) and the relationship between entities. The implementation is how the software programs and users will interact with the data (text files, binary files, relational databases, object databases, Excel, ...) to access data, perform calculations, and create information. Almost any kind of implementation can be used to solve any kind of problem, but in general, each type of problem has a limited optimal implementations. The factors that affect implementation choices include ease of use, scalability (time, space, and complexity), and application requirements (reads, writes, data persistence, updates ...). The rapid increases in the volumes of data being collected with current and future Next Gen sequencing technologies have significant data scalability and complexity issues and solutions like HDF become attractive. To understand why let's first look at some of the common data handling methods and discuss their advantages and disadvantages for working with data.

Text is easy
The most common implementation is text. Text formats dominate bioinformatics applications, for good reason. Text is human readable, can represent numbers as accurately as may be desired, and text is compatible with common utilities (e.g., grep), text editors, and languages like Perl. However, using text for high volume, complex data is inefficient and problematic. Computations must first convert data from text to binary, and, compared to binary data, it takes nearly three times as much space to represent an integer using text. Further, text files cannot represent complex data structures and relationships in ways that are easy to navigate computationally. Since scientific data are hierarchical in nature, systems that rely on text-based files often have multiple redundant methods for storing information. Text files lack random access; an object in a text file can be found only by reading the file from the beginning. Hence, almost all text file applications require that the entire file be read into memory, which seriously limits the practical size that a text file can be. Finally, the ease of creating new text-based formats leads to a number of obscure formats that proliferate rapidly, resulting in an interoperability nightmare requiring translators, that can import data from other applications, accompany almost every application.

XML must be better
The next step up in text is XML. XML is gaining popularity as a powerful data description language that uses standard tags to define the structure and the content of a file. XML can be used to store data, is very good at describing data and relationships, and works well if the amount of data is small. However, since XML is text-based, it has the same shortcomings as text files and is unsuitable for large data volumes.

How about a database?
For many bioinformatics applications, commercial and open-source database management systems (DBMSs) provide easy and efficient access to data; scientific tools submit declarative queries to the DBMS, which optimizes and executes the queries. Although DBMSs are excellent for many uses, they are often not sufficient for applications involving complex high volume data. Traditional DBMSs require significant performance tuning for scientific applications, a task that scientists are neither well prepared for, nor eager to address. Since most scientific data are write-once, and read-only thereafter, traditional relational concurrency control and logging mechanisms can add substantially to computing overhead. High-end database software, such as software for parallel systems, is especially expensive, and is not available for the leading-edge parallel machines popular in scientific computing centers. Most importantly, the complex data structures and data types needed to represent the data and relationships are not
supported well in most databases.

Let's go Bi
Binary formats are efficient for random data reading and writing, computation, storing numeric values, and data organization. These formats are used to store a significant amount of the data generated by analytical instruments and data created by desktop applications. Many of these formats however, are either proprietary or not publicly documented, limiting access to the data. This problem is often addressed in an unsatisfactory way through time-expensive, and error-prone, reverse engineering endeavors that can violate license agreements.

Next Gen needs a different approach
Indeed, the Next Gen community is arriving at the need to use binary file formats to improve data storage, access, and computational efficiency. The Short (Sequence) Read Format (SRF) is an example where a data format and binary file implementation are being used to develop efficient read storing methods. Popular algorithms such as Maq are also converting text files to binary files prior to computation to improve data access efficiency.

Wouldn't it be nice if we had a common binary format to implement our data models?

Why HDF?
For the reasons outlined above, Geospiza felt it would be worthwhile to explore general purpose, open source, binary file storage technologies, and looked to other scientific communities to learn how similar problems were being addressed. That search identified HDF (hierarchical data format) as a candidate technology. Initially developed in 1988 for storing scientific data, HDF is well established in many scientific fields and bioinformatics applications utilizing HDF can benefit from its long history and an infrastructure of existing tools.

Geospiza, together with The HDF Group, conducted a feasibility study to examine whether or not HDF would be helpful for addressing the data management problems of large volumes and complex biological data. The first test case looked at both issues by working with a large volume, highly complex, system for DNA sequencing-based SNP discovery. Through this study, HDF's strengths and data organization features (groups, sets, multidimensional arrays, transformations, linking objects, and general data storage for other binary data types and images) were evaluated to determine how well these features would handle SNP data. In addition to the proposed feasibility project with SNP discovery, other test cases were added, in collaboration with the NCBI, to test the ability of HDF to handle extremely large datasets. These addressed working with HapMap data and performing chromosomal scale LD (Linkage Disequilibrium) calculations.

That story is next ...

Labels: , , ,