Tuesday, February 23, 2010

GeneSifter Lab Edition v3.14 - Release Notes

GeneSifter Laboratory Edition (GSLE) 3.14.0 introduces a host of new features and capabilities that make daily laboratory data management work even easier.  Read below to learn why GSLE is a leading LIMS product for all forms of DNA sequencing, microarrays, and other genetic analysis applications.

Orders and Invoices

Multi plate submissions: Order forms have been extended in several ways to further simplify how labs collect sample and project information. A new order form template lets core facilities, managing larger sequencing projects, easily receive samples and their information in a multiple plate format. New order fields specific to the plate format are included to support sample tracking and lab work.

Add data to fields: Orders forms have been further improved by adding the ability to add new values (or terms) to dropdown fields that already exist on published order forms.


Project field: Additionally, labs can add an optional project field to forms. With these improvements, labs can create forms that are easier to use and modify, as well as enable project tracking for their customers.

Sample location and sample selection: Two new features deliver help for labs that provide sample storage (biobanking) services to their clients. First, order forms can include sample location information. This is particularly useful in situations where samples are delivered in 96-well plates that are stored for later use. Second, samples already stored by the lab as purified DNA, RNA or other material (templates) can be selected from specialized search interfaces within order forms. Like all GSLE sample entry forms, these features can be included or not on a case-by-case basis depending on your specific needs. 

Invoice formatting: For labs that have the dreaded chore of sending billing data to accounting departments we have added the ability to modify the invoice number format to include additional characters that are used to distinguish which labs are sending information.

Laboratory Operations


GSLE provides the ability to create, list and follow steps in sample protocols (also called workflows). In 3.14 new features not only expand the capabilities but make it possible to further standardize procedures. 


Multiplexing: In Next Generation Sequencing (NGS) several libraries are often combined into a single lane or region of a slide to increase the number of individual samples analyzed in a sequencing run. As each library is prepared, a specific adaptor sequence is added so sequence reads corresponding to different samples can be identified by their adaptor tag. This procedure, called multiplexing or barcoding, is supported in 3.14 and allows the lab to combine samples and adaptor sequences and group the combination of libraries together (Worksets) for sample processing and instrument runs. Once data are collected, sample naming conventions, combined with adaptor sequence (Multiplex Identifier, MID) stored in sample sheets, are used to separate individual reads into files corresponding to the samples that were in the original workset.

Batch data entry: Some lab processes require that samples are manipulated in groups (batches), but laboratory data are collected for individual samples within the batch. For example, the concentrations of individual DNA samples may need to be measured in a 96-well plate. To improve how the OD values, comments, or other information are entered, workflow steps have been updated to include batch data entry forms that provide spreadsheet like data entry capabilities. Like all GSLE batch data entry forms, data can be entered easily using the form’s column highlight and easy fill controls, or uploaded from an excel spreadsheet.

Subsample processing: GSLE 3.14 also increases sample processing flexibility. As noted above, order forms can now support the ability to select samples that are already stored in the system. This feature is further extended into the laboratory by creating tools that allow many new samples to be created from a “parent” or stock samples. When new samples (templates) are created, options are provided so that each new sample can be entered into a different process. For example, you receive a tissue sample that needs several experiments performed; RNA-Seq, ChIP-Seq and resequencing. Now you can easily pick the sample and create three new sub samples defining which process will be performed on each sample with just a few clicks.

Selecting samples based on custom data: Some labs need to use custom data entered into order forms to sort and filter samples in the lab. For example, an order form may ask a researcher to enter read lengths for their NGS run. A 36 base run is much faster than a 100 base run, and on some platforms costs less. Thus, the lab will sort samples based on read length prior to the data collection event. While always possible to get this information in many GSLE displays, 3.14 adds new capabilities to use any custom data in its specialized sample picker tools.

Other Features

Customer data management: GSLE v3.14 gives labs’ customers increased ability to organize their chromatograms, fragment analysis files and microarray files as needed. Data files can be edited, relabeled, moved or deleted. Projects and folders can be created, modified or deleted to aid in data organization.

Application Programming Interface (Onsite Installations Only)

SQL-API: As automation and system integration needs increase, requirements for supporting programmatic data entry become more important. GSLE has continued to expand the self-documenting Application Programming Interface (API). We have also added an SQL API that can be used to create custom reports that are accessed via a wget style unix command.


Input API enhancements: The Input API now returns success IDs and CGI parameter names have been eliminated. The full documentation can be reviewed by contacting support@geospiza.com for the GSLE SQL API Manual or the GSLE Input API Manual. 


Next Generation Analysis Transfer Tool (Hosted Partners Only)

Simplified data transfers: A data transfer interface has been added to connect GSLE and GeneSifter Analysis Edition (GSAE). Partner Program administrators use the interface to select data files in GSLE and transfer them to their customer’s account in GSAE.

Schema Table update note


There was an update to an existing schema table;  the column "Plate_Label" is now in table om_sample_plate instead of om_order.

Labels: , , , , ,

Thursday, October 8, 2009

Resequencing and Cancer

Yesterday we released news about new funding from NIH for a project to work on ways to improve how variations between DNA sequences are detected using Next Generation Sequencing (NGS) technology. The project emphasizes detecting rare variation events to improve cancer diagnostics, but the work will support a diverse range of resequencing applications.

Why is this important?

In October 2008, the U.S. News and World Report published an article by Bernadine Healy, former head of NIH. The tag line “understanding the genetic underpinnings of cancer is a giant step toward personalized medicine,” (1) underscores how the popular press views the promise of recent advances in genomics technology in general, and the progress toward understanding the molecular basis of cancer. In the article, Healy presents a scenario where, in 2040, a 45-year-old woman, who has never smoked, develops lung cancer. She undergoes outpatient surgery, and her doctors quickly scrutinize the tumor’s genes and use a desktop computer to analyze the tumor genomes, and medical records to create a treatment plan. She is treated, the cancer recedes and subsequent checkups are conducted to monitor tumor recurrence. Should a tumor be detected, her doctor would quickly analyze the DNA of a few of the shed tumor cells and prescribe a suitable next round of therapy. The patient lives a long happy life, and keeps her hair.

This vision of successive treatments based on genomic information is not unrealistic, claims Healy, because we have learned that while many cancers can look homogeneous in terms of morphology and malignancy they are indeed highly complex and varied when examined at the genetic level. The disease of cancer is in reality a collection of heterogeneous diseases that, even for common tissues like the prostate, can vary significantly in terms of onset and severity. Thus, it is often the case that cancer treatments, based on tissue type, fail, leaving patients to undergo a long painful process of trial and error therapies with multiple severely toxic compounds.

Because cancer is a disease of genomic alterations, understanding the sources, causes, and kinds of mutations, and their connection to specific types of cancer, and how they may predict tumor growth is worthwhile. The human cancer genome project (2) and initiatives like the international cancer genome consortium (3) have demostrated this concept. The kinds of mutations found in tumor populations, thus far by NGS, include single nucleotide polymorphisms (SNPs), insertions and deletions, and small structural copy number variations (CNVs) (4, 5). From early studies it is clear that a greater amount of genomic information will be needed to make Healy's scenario a reality. Next generation sequencing (NGS) technologies will drive this next phase of research and enable our deeper understanding.

Project Synopsis

The great potential for the clinical applications of new DNA sequencing technologies comes from their highly sensitive ability to assess genetic variation. However, to make these technologies clinically feasible, we must assay patient samples at far higher rates than can be done with current NGS procedures. Today, the experiments applying NGS, in cancer research have investigated small numbers of samples in great detail, in some cases comparing entire genomes from tumor and normal cells from a single patient (6-8). These experiments show, that when a region is sequenced with sufficient coverage, numerous mutations can be identified.

To move NGS technologies into clinical use many costs must decrease. Two ways costs can be lowered are to increase sample density and reduce the number of reads needed per sample. Because cost is a function of turnaround time and read coverage, and read coverage is a function of the signal to noise ratio, assays with a higher background noise, due to errors in the data, will require higher sampling rates to detect true variation and be more expensive. To put this in context, future cancer diagnostic assays will likley need to look at over 4000 exons per test. In cases like bladder cancer, or cancers where stool or blood are sampled, non-invasive tests will need to detect variations in one out of 1000 cells. Thus it is extremely important that we understand signal/noise ratios and to be able to calculate read depth in a reliable fashion.

Currently we have a limited understanding of how many reads are needed to detect a given rare mutation. Detecting mutations depends on a combination of sequencing accuracy and depth of coverage. The signal (true mutations) to noise (false mutations, hidden mutations) depends on how many times we see a correct result. Sequencing accuracy is affected by multiple factors that include sample preparation, sequence context, sequencing chemistry, instrument accuracy, and basecalling software. The current depth-of-coverage calculations are based on an assumption that sampling is random, which is not valid in the real world. Corrections will have to be applied to adjust for real-world sampling biases that affect read recovery and sequencing error rates (9-11).

Developing clinical software systems that can work with NGS technologies, to quickly and accurately detect rare mutations, requires a deep understanding of the factors that affect the NGS data collection and interpretation. This information needs to be integrated into decision control systems that can, through a combination of computation and graphical displays, automate and aid a clinician’s ability to verify and validate results. Developing such systems are major undertakings involving a combination of research and development in the areas of laboratory experimentation, computational biology, and software development.

Positioned for Success

Detecting small genetic changes in clinical samples is ambitious. Fortunately, Geospiza has the right products to deliver on the goals of the research. GeneSifter Lab Edition handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. The data management and analysis server make the system scalable through a distributed architecture. Combined with GeneSifter Analysis Edition, a complete platform is created to rapidly prototype new data analysis workflows needed to test new analysis methods, experiment with new data representations, and iteratively develop data models to integrate results with experimental details.

References:

Press Release: Geospiza Awarded SBIR Grant for Software Systems for Detecting Rare Mutations

1. Healy, 2008. "Breaking Cancer's Gene Code - US News and World Report" http://health.usnews.com/articles/health/cancer/2008/10/23/breaking-cancers-gene-code_print.htm

2. Working Group, 2005. "Recommendation for a Human Cancer Genome Project" http://www.genome.gov/Pages/About/NACHGR/May2005NACHGRAgenda/ReportoftheWorkingGrouponBiomedicalTechnology.pdf

3. ICGC, 2008. "International Cancer Genome Consortium - Goals, Structure, Policies &Guidelines - April 2008" http://www.icgc.org/icgc_document/

4. Jones S., et. al., 2008. "Core Signaling Pathways in Human Pancreatic Cancers Revealed by Global Genomic Analyses." Science 321, 1801.

5. Parsons D.W., et. al., 2008. "An Integrated Genomic Analysis of Human Glioblastoma Multiforme." Science 321, 1807.

6. Campbell P.J., et. al., 2008. "Subclonal phylogenetic structures in cancer revealed by ultra-deep sequencing." Proc Natl Acad Sci U S A 105, 13081-13086.

7. Greenman C., et. al., 2007. "Patterns of somatic mutation in human cancer genomes." Nature 446, 153-158.

8. Ley T.J., et. al., 2008. "DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome." Nature 456, 66-72.

9. Craig D.W., et. al., 2008. "Identification of genetic variants using bar-coded multiplexed sequencing." Nat Methods 5, 887-893.

10. Ennis P.D.,et. al., 1990. "Rapid cloning of HLA-A,B cDNA by using the polymerase chain reaction: frequency and nature of errors produced in amplification." Proc Natl Acad Sci U S A 87, 2833-2837.

11. Reiss J., et. al., 1990. "The effect of replication errors on the mismatch analysis of PCR-amplified DNA." Nucleic Acids Res 18, 973-978.

Labels: , , ,

Tuesday, October 6, 2009

From Blue Gene to Blue Genome? Big Blue Jumps In with DNA Transistors

Today, IBM announced that are getting into the DNA sequencing business and race for the $1,000 dollar genome by winning a research grant to explore new sequencing technology based on nanopore devices they call DNA transistors.

IBM news travels fast. Genome Web and The Daily Scan covered the high points and Genetic Future presented a skeptical analysis of the news. You can read the original news at the IBM site, and enjoy a couple of videos.

A NY Times article a listed a couple of facts that I thought were interesting: First, IBM becomes the 17th company to pursue the next next-gen (or third-generation) technology. Second, according to George Church, in the past five years the cost of collecting DNA sequence data has decreased by 10 fold annually and is expected to continue decreasing at a similar pace for the next few years.

But what does this all mean?

It is clear from this and other news that DNA sequencing is fast becoming a standard way to study genomes, gene expression, and measure genetic variation. It is also clear the while the cost of DNA sequencing is decreasing at a fast rate, the amount of data being produced is increasing at a similarly fast rate.

While some of the articles above discussed the technical hurdles nanopore sequencing must overcome, none discussed the real challenges researchers face today with using the data. The fact is, for most groups, the current next-gen sequencers are under utilized because the volumes of data combined with the complexity of data analysis has created a significant bioinformatics bottleneck.

Fortunately, Geospiza is clearing data analysis barriers by delivering access to systems that provide standard ways of working with the data and visualizing results. For many NGS applications, groups can upload their data to our servers, align reads to reference data sources, and compare the resulting output across multiple samples in efficient and cost effective processes.

And, because we are meeting the data analysis challenges for all of the current NGS platforms, we'll be ready for whatever comes next.

Labels: , ,

Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Labels: , , , , , , ,

Sunday, July 12, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part V: Why HDF5?

Through the course of this BioHDF bloginar series, we have demonstrated how the HDF5 (Hierarchical Data Format) platform can successfully meet current and future data management challenges posed by Next Generation Sequencing (NGS) technologies. We now close the series by discussing the reasons why we chose HDF5.

For previous posts, see:

  1. The introduction
  2. Project background
  3. Challenges of working with NGS data
  4. HDF5 benefits for working with NGS data

Why HDF5?

As previously discussed, HDF technology is designed for working with large amounts of complex data that naturally organize into multidimensional arrays. These data are composed of discrete numeric values, strings of characters, images, documents, and other kinds of data that must be compared in different ways to extract scientific information and meaning. Software applications that work with such data must meet a variety of organization, computation, and performance requirements to support the communities of researchers where they are used.

When software developers build applications for the scientific community, they must decide between creating new file formats and software tools for working with the data, or adapting existing solutions that already meet a general set of requirements. The advantage of developing software that's specific to an application domain is that highly optimized systems can be created. However, this advantage can disappear when significant amounts of development time are needed to deal with the "low-level" functions of structuring files, indexing data, tracking bits and bytes, making the system portable across different computer architectures, and creating a basic set of tools to work with the data. Moreover, such a system would be unique, with only a small set of users and developers able to understand and share knowledge concerning its use.

The alternative to building a highly optimized domain-specific application system is to find and adapt existing technologies, with a preference for those that are widely used. Such systems benefit from the insights and perspective of many users and will often have features in place before one even realizes they are needed. If a technology has widespread adoption, there will likely be a support group and knowledge base to learn from. Finally, it is best to choose a solution that has been tested by time. Longevity is a good measure of the robustness of the various parts and tools in the system.

HDF: 20 Years in Physical Sciences

Our requirements for high-performance data management and computation system are these:

  1. Different kinds of data need to be stored and accessed.
  2. The system must be able to organize data in different ways.
  3. Data will be stored in different combinations.
  4. Visualization and computational tools will access data quickly and randomly.
  5. Data storage must be scalable, efficient, and portable across computer platforms.
  6. The data model must be self describing and accessible to software tools.
  7. Software used to work with the data must be robust, and widely used.

HDF5 is a natural fit. The file format and software libraries are used in some of the largest data management projects known to date. Because of its strengths, HDF5 is independently finding its way into other bioinformatics applications and is a good choice for developing software to support NGS.

HDF5 software provides a common infrastructure that allows different scientific communities to build specific tools and applications. Applications using HDF5 typically contain three parts: one or more HDF5 files to store data, a library of software routines to access the data, and the tools, applications and additional libraries to carry out functions that are specific to a particular domain. To implement an HDF5-based application, a data model be developed along with application specific tools such as user interfaces and unique visualizations. While implementation can be a lot of work in its own right, the tools to implement the model and provide scalable, high-performance programmatic access to the data have already been developed, debugged, and delivered through the HDF I/O (input/output) library.

In earlier posts, we presented examples where we needed to write software to parse fasta formatted sequence files and output files from alignment programs. These parsers then called routines in the HDF I/O library to add data to the HDF5 file. During the import phase, we could set different compression levels and define the chunk size to compress our data and optimize access times. In these cases, we developed a simple data model based on the alignment output from programs like BWA, Bowtie, and MapReads. Most importantly, we were able to work with NGS data from multiple platforms efficiently, with software that required weeks of development rather than the months and years that would be needed if the system was built from scratch.

While HDF5 technology is powerful "out-of-the-box," a number of features can still be added to make it better for bioinformatics applications. The BioHDF project is about making such domain-specific extensions. These are expected to include modifications to the general file format to better support variable data like DNA sequences. I/O library extensions will be created to help HDF5 "speak" bioinformatics by creating APIs (Application Programming Interfaces) that understand our data. Finally, sets of command line programs and other tools will be created to help bioinformatics groups get started quickly with using the technology.

To summarize, the HDF5 platform is well-suited for supporting NGS data management and analysis applications. Using this technology, groups will be able to make their data more portable for sharing because the data model and data storage are separated from the implementation of the model in the application system. HDF5's flexibility for the kinds of data it can store, makes it easier to integrate data from a wide variety of sources. Integrated compression utilities and data chunking make HDF5-based systems as scalable as they can be. Finally, because the HDF5 I/O library is extensive and robust, and the HDF5 tool kit includes basic command-line and GUI tools, a platform is provided that allows for rapid prototyping, and reduced development time, thus making it easier to create new approaches for NGS data management and analysis.

For more information, or if you are interested in collaborating on the BioHDF project, please feel free to contact me (todd at geospiza.com).

Labels: , , ,

Monday, July 6, 2009

Bloginar: Scalable Bioinformatics Infrastructures with BioHDF. Part IV: HDF5 Benefits

Now that we're back from Alaska and done with the 4th of July fireworks, it's time to present the next installment of our series on BioHDF.

HDF highlights
HDF technology is designed for working with large amounts of scientific data and is well suited for Next Generation Sequencing (NGS). Scientific data are characterized by very large datasets that contain discrete numeric values, images, and other data, collected over time from different samples and locations. These data naturally organize into multidimensional arrays. To obtain scientific information and knowledge, we combine these complex datasets in different ways and (or) compare them to other data using multiple computational tools. One difficulty that plagues this work is that the software applications and systems for organizing the data, comparing datasets, and visualizing the results are complicated, resource intensive, and challenging to develop. Many of the development and implementation challenges can be overcome using the HDF5 file format and software library

Previous posts have covered:
1. An introduction
2. Background of the project
3. Complexities of NGS data analysis and performance advantages offered by the HDF platform.

HDF5 changes how we approach NGS data analysis.

As previously discussed, the NGS data analysis workflow can be broken into three phases. In the first phase (primary analysis) images are converted into short strings of bases. Next, the bases, represented individually or encoded as di-bases (SOLiD), are aligned to reference sequences (secondary analysis) to create derivative data types such as contigs or annotated tables of alignments, that are further analyzed (tertiary analysis) in comparative ways. Quantitative analysis applications, like gene expression, compare the results of secondary analyses between individual samples to measure gene expression and identify mRNA isoforms, or make other observations based on a sample’s origin or treatment.

The alignment phase of the data analysis workflow creates the core information. During this phase, reads are aligned to multiple kinds of reference data to understand sample and data quality, and obtain biological information. The general approach is to align reads to sets of sequence libraries (reference data). Each library contains a set of sequences that are annotated and organized to provide specific information.

Quality control measures can be added at this point. One way to measure data quality is to ask how many reads were obtained from constructs without inserts. Aligning the read data to a set of primers (individually and joined in different ways) that were used in the experiment, allows us to measure the number reads that match and how well they match. A higher quality dataset will have a larger proportion of sequences matching our sample and a smaller proportion of sequences that only match the primers. Similarly, different biological questions can be asked using libraries constructed of sequences that have biological meaning.

Aligning reads to sequence libraries is the easy part. The challenge is analyzing the alignments. Because the read datasets in NGS assays are large, organizing alignment data into forms we can query is hard. The problem is simplified by setting up multistage alignment processes as a set of filters. That is, reads that match one library are excluded from the next alignment. Differential questions are then asked by counting the numbers of reads that match each library. With this approach, each set of alignments is independent of the other alignments and a program only needs to analyze one set of alignments at time. Filter-based alignment is also used to distinguish reads with perfect matches from those with one or more mismatches.

Still, filter-based alignment approaches have several problems. When new sequence libraries are introduced, the entire multistage alignment process must be repeated to update results. Next, information about reads that have multiple matches in different libraries, or perfect matches and non-perfect matches within a library are lost. Finally, because alignment formats between programs differ and good methods for organizing alignment data do not exist, it is hard to compare alignments between multiple samples. This last issue also creates challenges for linking alignments to the original sequence data and formatting information for other tools.

As previously noted, solving the above problems requires that alignment data be organized in ways that facilitate computation. HDF5 provides the foundation to organize and store both read and alignment data to enable different kinds of data comparisons. This ability is demonstrated by the following two examples.

In the first example (left), reads from different sequencing platforms (SOLiD, Illumina, 454) were stored in HDF5. Illumina RNA-Seq reads from three different samples were aligned to the human genome and annotations from a UCSC GFF (genome file format) file were applied to define gene boundaries. The example shows the alignment data organized into three HDF5 files, one per sample, but in reality the data could have been stored in a single file or files organized in other ways. One of HDF's strengths is that the HDF5 I/O library can query multiple files as if they were a single file, providing the ability to create the high-level data organizations that are the most appropriate for a particular application or use case. With reads and alignments structured in these files, it is a simple matter to integrate data to view base (color) compositions for reads from different sequencing platforms, compare alternative splicing between samples, and select a subset of alignments from a specific genomic region, or gene, in a "wig" format for viewing in a tool like the UCSC genome browser.

The second example (right) focuses on how organizing alignment data in HDF5 can change how differential alignment problems are approached. When data are organized according to a model that defines granularity and relationships, it becomes easier to compute all alignments between reads and multiple reference sources, than think about how to perform differential alignments and implement the process. In this case, a set of reads (obtained from cDNA) are aligned to primers, QC data (ribosomal RNA [rRNA] and mitochondrial DNA [mtDNA]), miRBase, refseq transcripts, the human genome, and a library of exon junctions. During alignment up to three mismatches are tolerated between a read and its hit. Alignment data are stored in HDF5 and, because the data were not filtered, a greater variety of questions can be asked. Subtractive questions mimic the differential pipeline where alignments are used to filter reads from subsequent steps. At the same time, we can also ask "biological" questions about the number of reads that came from rRNA or mtDNA or from genes in the genome or exon junctions. And for these questions, we can examine the match quality between each read and its matching sequence in the reference data sources, without having to reprocess the same data multiple times.

The above examples demonstrate the benefits of being able to organize data into structures that are amenable to computation. When data are properly structured, new approaches that expand the ways in which data are analyzed can be implemented. HDF5 and its library of software routines move the development process from activities associated with optimizing the low level infrastructures needed to support such systems to designing and testing different data models and exploiting their features.

The final post of this series will cover why we chose to work with HDF5 technology.

Labels: , , , ,

Tuesday, April 21, 2009

What if dbEST was an NGS Experiment? Part I: dbEST

Back in 1997, this alarming statement appeared in a paper [1]:

“Biological research is generating data at an explosive rate. Nucleotide sequence databases along are growing at a rate of >210 million base pairs (bp)/year and it has been estimated that if the present rate of growth continues, by the end of the millennium the sequence databases will have grown to 4 billion bp!” [emphasis mine]

Imagine 4 billion bp of data - what would we do with all that?


The article was about the defunct Merck Gene Index browser, which was developed to make massive numbers of cDNA sequences, also called Expressed Sequence Tags (ESTs), available through a web-based system. The ESTs were being generated through the Merck Gene Index Project which was one of many public and private projects focused on collecting EST and full length cDNA sequences from human and model organism samples. The goal of these projects was to create data resources of transcript sequences for studying gene expression and later finding genes in genomic sequence data. Combined, these projects cost 10's of millions of dollars and spanned nearly a decade. They also produced millions of ESTs that are now stored in NCBI’s dbEST database [2].

And the prediction of GenBank’s growth was close, release 115 of GenBank (Dec, 1999) had 4.6 billion bases. With the most recent release, 9 years later, GenBank has grown to over 103 billion bases and some would say we are just getting started with sequencing DNA [3].

Today, for a few thousand dollars, a single run of an Illumina, SOLiD, or Helicos instrument can collect a greater amount of data than has ever been produced from all the EST projects combined. This begs the question, what would the data look like if dbEST was a Next Generation Sequencing (NGS) experiment?

A Brief History of dbEST

Before we get into comparing dbEST to a Next Generation DNA Sequencing (NGS) experiment, we should discuss what dbEST is and how it came to be. In the early days of automated DNA sequencing (ca. 1990) it was realized that cDNA, reverse transcribed from mRNA, could be partially sequenced and the resulting data could be used to measure which genes are expressed in a cell or tissue. The term EST was coined to describe the fact that each sequence corresponded to an mRNA molecule, and was in effect a “tag” for that molecule [4]. EST stands for Expressed Sequence Tag.

During the early years EST sequencing was controversial. Many proponents of the genome project felt that collecting ESTs would obviate the need for sequencing the entire genome and congress would end funding for the genome project before it was complete. Further controversy arose when NIH decided to patent several of the early brain ESTs. This news created an uproar in the community and led to the famous statement by one nobel laureate that automated sequencing machines “could be run by monkeys [5].”

ESTs also led to the founding of dbEST [2], a valuable resource for quickly assessing the functional aspects of the genome and later for identifying and annotating genes within genomic sequences. Today, EST projects continue to be worthwhile endeavors for exploring new organisms before full genome sequencing can be performed.

In the 15+ years since the founding of dbEST, the database has grown from 22,537 entries to approximately 61 million (4/17/2009). The first dbEST report contained ESTs from seven organisms. Today, over 1700 organisms are represented in dbEST. The species with the highest numbers of ESTs (> 1,000,000) include human, mouse, corn, pig, Arabidopsis, cow, zebrafish, soybean, Xenopus, rice, Ciona, wheat, and rat. More than half of the species however, have fewer than 10,000 ESTs. Since January of this year dbEST has grown by more than 2,000,000 entries.

Despite its value, dbEST, like many resources at the NCBI, requires an “expert” level of understanding to be useful. As classical clone-based cDNA sequencing gives way to more cost effective higher throughput methods like NGS, less emphasis will be placed on making this resource useful beyond maintaining the data as an archival resource that the community can access.

What this means is that when you visit the site, it does not look like much is there. You can get links to the original (closed access) papers and learn about how many sequences are present for each organism. Accession numbers, or gene names can used to look up a sequence and from other pages you can use BLAST to search the resource with a query sequence.

If you want to know more, you have to know how to look for the information and deal with it in the context in which it is presented. For example, I mentioned that dbEST has grown since January. I knew this because, I looked at the list of organisms and numbers of sequences then and now and noticed that more are reported now. However, to tell you where numbers have increased for which organisms, or whether new organisms have been added would require significant time and effort by either saving the different release reports or digging through the dbEST ftp site. When we return to the story, we’ll do some "ftp archealogy" and dig through dbEST records to begin characterizing the human ESTs.


References:

1. Eckman B.A., Aaronson J.S., Borkowski J.A., Bailey W.J., Elliston K.O., Williamson A.R., Blevins R.A., 1998. The Merck Gene Index browser: an extensible data integration system for gene finding, gene characterization and EST data mining. Bioinformatics 14, 2-13.

2. Boguski M.S., Lowe T.M., Tolstoshev C.M., 1993. dbEST--database for “expressed sequence tags”. Nat Genet 4, 332-333. See also: http://www.ncbi.nlm.nih.gov/dbEST/

3. ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt

4. Adams M.D., Kelley J.M., Gocayne J.D., Dubnick M., Polymeropoulos M.H., Xiao H., Merril C.R., Wu A., Olde B., Moreno R.F., 1991. Complementary DNA sequencing: expressed sequence tags and human genome project. Science 252, 1651-1656.
And http://www.genomenewsnetwork.org/resources/timeline/1991_Venter.php

5. http://www.nature.com/nature/journal/v405/n6790/full/405983b0.html

Labels: , , ,

Wednesday, January 28, 2009

The Next Generation Dilemma: Large Scale Data Analysis

Next week is the AGBT genome conference in Marco Island, Florida. At the conference we will present a poster on work we have been doing with Next Gen Sequencing data analysis. In this post we present the abstract. We'll post the poster when we return from sunny Florida.

Abstract

The volumes of data that can be obtained from Next Generation DNA sequencing instruments make several new kinds of experiments possible and new questions amenable to study. The scale of subsequent analyses, however, presents a new kind of challenge. How do we get from a collection of several million short sequences of bases to genome-scale results? This process involves three stages of analysis that can be described as primary, secondary, and tertiary data analyses. At the first stage, primary data analysis, image data are converted to sequence data. In the middle stage, secondary data analysis, sequences are aligned to reference data to create application-specific data sets for each sample. In the final stage, tertiary data analysis, the data sets are compared to create experiment-specific results. Currently, the software for the primary analyses is provided by the instrument manufacturers and handled within the instrument itself, and when it comes to the tertiary analyses, many good tools already exist. However, between the primary and tertiary analyses lies a gap.

In RNA-Seq, the process of determining relative gene expression means that sequence data from multiple samples must go through the entire process of primary, secondary, and tertiary analysis. To do this work, researchers must puzzle through a diverse collection of early version algorithms that are combined into complicated workflows with steps producing complicated file formats. Command line tools such as MAQ, SOAP, MapReads, and BWA, have specialized requirements for formatted input and output and leave researchers with large data files that still require additional processing and formatting for tertiary analyses. Moreover, once reads are aligned, datasets need to be visualized and further refined for additional comparative analysis. We present a solution to these challenges that closes the gaps between primary, secondary, and tertiary analysis by showing results from a complete workflow system that includes data collection, processing and analysis for RNA-seq.

And, if you cannot be in sunny Florida, join us in Memphis where we will help kick off the ABRF conference with a workshop on Next Generation DNA Sequencing. I'm kicking the workshop off with a talk entitled "From Reads to Data Sets, Why Next Gen is Not Like Sanger Sequencing."

Labels: , , , , , , , ,

Wednesday, December 31, 2008

Closing 2008

As we bring 2008 to a close, it is a good time to reflect on our progress and think about the new year ahead. Despite the world economic news, both the genomics field in general and Geospiza specifically have many positive accomplishments to show for the year.

In February, we introduced FinchLab for Next Gen Sequencing at the AGBT and ABRF conferences. At these shows, it was clear that Next Gen Sequencing was going to change the ways we think about applying DNA sequencing to interrogate a multitude of genetic and functional genomics problems. Over the course of 2008, many papers were published demonstrating the value of the massively parallel sequencing technology. MassGenomics dubbed 2008: Year of the Cancer Genome. Other blogs are following suit with articles on personal genomics and other advancements provided largely through Next Gen Sequencing.

Throughout the year, we also learned that while you can do a lot with a huge amount of data, working with the data is extremely challenging. Conference presentations and editorials in journals frequently made this point. While many of these articles focused on the data management challenge, groups acquiring the technology were also learning that the challenges go beyond data management. Comprehensive software systems are needed to manage all facets of the process, from tracking how samples are prepared for specific experiments to how the data are stored and organized, to analyzing and presenting the data according to the experiment being performed. In short, we learned that Next Gen technologies produce sequence data in different ways and require that we think about DNA sequencing in new ways.

Geospiza’s Version 3 Software Platform and GeneSifter

To address these new challenges, and expand support for existing technologies, Geospiza accomplished two significant milestones in 2008. First, we released the third version of our software platform that supports both laboratory workflows and data analysis automation. Through this system, laboratories are able to set up different interfaces to collect experimental information, assign specific workflows to experiments, track the workflow steps in the laboratory, prepare samples for data collection runs, link data back to the original samples and process data according to the needs of the experiment - without any programming. More importantly, for those who want to develop data analysis pipelines, the system provides a deployable environment that lets you add new pipelines and make them easily accessible.

The second major milestone was our acquisition of GeneSifter. GeneSifter is an award-winning microarray data analysis product. With GeneSifter , Geospiza can deliver complete end to end systems for data intensive genetic analysis applications like microarrays and Next Gen sequencing based transcription. Also, GeneSifter, like Geospiza’s other products is web-based and can be delivered as a Software as a Service (SaaS) product.

SaaS was one of the important themes for 2008. Geospiza understands well that data intensive science requires a significant IT (Information Technology) investment. Throughout 2008, we saw first-hand that groups building their own IT infrastructures were not only challenged by investing heavily in quickly depreciating hardware assets, they experienced basic infrastructure challenges like having enough space, power, and cooling systems for the equipment. If those problems were solved, there were the other challenges with getting systems set up, running, installing software, and having experienced people - and time - to maintain the infrastructure. SaaS solves those problems and off loads the burden of maintaining expensive infrastructures. For a number of groups, locally run systems are the right choice. However, it is a choice that should be carefully thought out and well-planned. In our experience, customers choosing the SaaS option were up and running quicker at a lower cost than our customers who chose to build their systems.

As we close 2008 and look forward to 2009, we want to especially thank our customers for their support and the interesting problems they have invited us to help solve.

Labels: , ,

Friday, December 12, 2008

Papers, Papers, and more Papers

Next Gen Sequencing is hot, hot, hot! You can tell by the numbers and frequency in which papers are being published.

A few posts ago, I wrote about a couple of grant proposals that we were preparing on methods to detect rare variants in cancer and improve the tools and methods to validate datasets from quantitative assays that utilize Next Gen data, like RNA-Seq, ChIP-Seq, or Other-Seq experiments. Besides the normal challenges of getting two proposals written and uploaded to the NIH, there was an additional challenge. Nearly everyday, we opened the tables-of-contents in our e-mail and found a new papers highlighting Next Gen Sequencing techniques, applications, or biological discoveries made through Next Gen techniques. To date, over 200 Next Gen publications have been produced. During the last two months alone more than 30 papers have been published. Some of these (listed in the figure below) were relevant to the proposals we were drafting.

The papers highlighted many of the themes we've touched on here, including the advantages of Next Gen sequencing and challenges with dealing with the data. As we are learning, these technologies allow us to explore the genome and genomics of systems biology at significantly higher resolutions than previously imagined. In one of the higher profile efforts, teams at the Washington University School of Medical and Genome Center compared a leukemia genome to a normal genome using cells from the same patient. This first intra-person whole genome analysis identified acquired mutations in ten genes, eight of which were new. Interestingly, the eight genes have unknown functions and might be important some day for new therapies.

Next Gen technologies are also confirming that molecular biology is more complicated than we thought. For example, the four most recent papers in Science show us that not only is 90% of the genome actively transcribed, but many genes have both sense and anti-sense RNA expressed. It is speculated that the anti-sense transcripts have a role in regulating gene expression. Also, we are seeing that nearly every gene produces alternatively spiced transcripts. The most recent papers indicate that between 92% and 97% of transcripts are alternatively spliced. My guess is that the only genes, not alternatively spliced are those lacking introns, like olfactory receptors. Although, when alternative transcription starts and alternative polyadenylation sites are considered, we may see that all genes are processed in multiple ways. It will be interesting to see how the products of alternative splicing and anti-sense transcription might interact.

This work has a number of take home messages.
  1. Like astronomy, when we can see deeper we see more. Next Gen technologies are giving us the means to interrogate large collections of individual RNA or DNA molecules and speculate more on functional consequences.
  2. Our limits are our imaginations. The reported experiments have used a variety of creative approaches to study genomic variation, sample expressed molecules from different strands of DNA, and measure protein DNA/RNA interaction.
  3. Good hands do good science. As pointed out in the paper from the Sanger Center on their implementation of Next Gen sequencing, the processes are complex and technically demanding. You need to have good laboratory practices with strong informatics support for all phases (laboratory, data management, and data analysis) of the Next Gen sequencing processes.
The final point is very important and Geospiza’s lab management and data analysis products will simplify your efforts in getting Next Gen systems running to make your major investment pay off and quickly publish results.

To see how, join us for a webinar next Wednesday, Dec. 17 at 10 am PDT, for RNA Expression Analysis with Geospiza.


Click on the figure to enlarge the text.

Labels: , , , , , ,

Sunday, November 9, 2008

Next Gen-Omics

Advances in Next Gen technologies have led to a number of significant papers in recent months, highlighting their potential to advance our understanding of cancer and human genetics (1-3). These and the other 100's of papers demonstrate the value of Next Gen sequencing. The work completed thus far has been significant, but much more needs to be done to make these new technologies useful for a broad range of applications. Experiments will get harder.

While much of the discussion in the press focuses on rapidly sequencing human genomes for low cost as part of the grail of personalized genomics (4), a vast amount of research must be performed at the systems level to fully understand the relationship between biochemical processes in a cell and how the instructions for the processes are encoded in the genome. Systems biology and a plethora of "omics" have emerged to measure multiple aspects of cell biology as DNA is transcribed into RNA and RNA translated into protein and proteins interact with molecules to carry out biochemistry.

As noted in the last post we are developing proposals to further advance the state-of-the-art in working with Next Gen data sets. In one of those proposals, Geospiza will develop novel approaches to work with data from applications of Next Gen sequencing technologies that are being developed study the omics of DNA transcription and gene expression.

Toward furthering our understanding of gene expression, Next Gen DNA sequencing is being used to perform quantitative assays where DNA sequences are used as highly informative data points. In these assays, large datasets of sequence reads are collected in a massively parallel format. Reads are aligned to reference data to obtain quantitative information by tabulating the frequency, positional information, and variation from the reads in the alignments. Data tables from samples that differ by experimental treatment, environment, or in populations, are compared in different ways to make discoveries and draw experimental conclusions. Recall the three phases of data analysis.

However, to be useful these data sets need to come from experiments that measure what we think they should measure. The data must be high quality and free of artifacts. In order to compare quantitative information between samples, the data sets must be refined and normalized so that biases introduced through sample processing are accounted for. Thus, a fundamental challenge to performing these kinds of experiments is working with the data sets that are produced. In this regard numerous challenges exist.

The obvious ones relating to data storage and bioinformatics are being identified in both the press and scientific literature (5,6). Other, less published, issues include a lack of:
  • standard methods and controls to verify datasets in the context of their experiments,
  • standardized ways to describe experimental information and
  • standardized quality metrics to compare measurements between experiments.
Moreover data visualization tools and other user interfaces, if available, are primitive and significantly slow that pace at which a researcher can work with the data. Finally, information technology (IT) infrastructures that can integrate the system parts dealing with sample tracking, experimental data entry, data management, data processing and result presentation are incomplete.

We will tackle the above challenges by working with the community to develop new data analysis methods that can run independently and within Geospiza's FinchLab. FinchLab handles the details of setting up a lab, managing its users, storing and processing data, and making data and reports available to end users through web-based interfaces. The laboratory workflow system and flexible order interfaces provide the centralized tools needed to track samples, their metadata, and experimental information. Geospiza's hosted (Software as a Service [SaaS]) delivery models remove additional IT barriers.

FinchLab's data management and analysis server make the system scalable through a distributed architecture. The current implementation of the analysis server creates a complete platform to rapidly prototype new data analysis workflows and will allow us to quickly devise and execute feasibility tests, experiment with new data representations, and iteratively develop the needed data models to integrate results with experimental details.

References

1. Ley, T. J., Mardis, E. R., Ding, L., Fulton, B., et al. DNA sequencing of a cytogenetically normal acute myeloid leukaemia genome. Nature 456, 66-72 (2008).

2. Wang, J., Wang, W., Li, R., Li, Y., et al. The diploid genome sequence of an Asian individual. Nature 456, 60-65 (2008).

3. Bentley, D. R., Balasubramanian, S., Swerdlow, H. P., Smith, G. P., et al. Accurate whole human genome sequencing using reversible terminator chemistry. Nature 456, 53-59 (2008).

4. My genome. So what? Nature 456, 1 (2008).

5. Prepare for the deluge. Nature Biotechnology 26, 1099 (2008).

6. Byte-ing off more than you can chew. Nature Methods 5, 577 (2008).

Labels: , , , , , ,

Monday, October 6, 2008

Sneak Peak: Genetic Analysis From Capillary Electrophoresis to SOLiD

On October 7, 2008 Geospiza hosted a webinar featuring the FinchLab, the only software product to track the entire genetic analysis process, from sample preparation, through processing to analyzed results.

If you are as disappointed about missing it as we are about you missing, no worries. You can get the presentation here.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing what makes the Applied Biosystems SOLiD system powerful for transcriptome analysis, CHiP-Seq, resequenicng experiments, and other applications
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we talked about the general applications of Next Gen sequencing and focused on using SOLiD to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling and whole transcriptome analysis. Throughout the talk we gave specific examples about collecting and analyzing SOLiD data and showed how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , , , ,

Wednesday, July 30, 2008

BioHDF at BOSC

The scale of Next Gen sequencing is only going to increase, hence we need to fundamentally change the way we work with Next Gen data. New software systems with scalable data models, APIs, software tools, and viewers are needed to support the very large datasets used by the applications that analyze Next Gen DNA sequence data.

That was the theme of a talk I presented at the BOSC (Bioinformatics Open Source Conference) meeting that preceded ISMB (Intelligent Systems for Molecular Biology) in Toronto, Canada, July 19th. You can get the slides from the BOSC site. At the same time, we posted a blog on Genographia, a next-generation genomics community web site devoted to Next Gen sequencing discussions and idea sharing. The key points are summarized below.

Motivation

The BioHDF project is motivated by the fact that the next and future generations of data collection technologies, like DNA sequencing, are creating ever increasing amounts of data. Getting meaningful information from these data require that multiple programs be used in complex processes. Current practices for working with these data create many kinds of challenges, ranging from managing large numbers of files and formats to having the computation power and bandwidth to make calculations and move data around. These practices have a high cost in terms of storage, CPU, and bandwidth efficiency. In addition, they require significant human effort in understanding idiosyncratic program behavior and output formats

Is there a better way?

Many would agree that if we could reduce the number of file formats, avoid data duplication, and improve how we access and process data, we could develop better performing and more interoperable applications. Doing so requires improved ways of storing data and making it accessible to programs. For a number of years we have thought about these goals might be accomplished and looked to other data-intensive fields to see how others have solved these problems. Our search ended when we found HDF (hierarchical data format), a standard file format and library used in the physical and earth sciences.

BioHDF

HDF5 can be used in many kinds of bioinformatics applications. For specialized areas, like DNA sequencing, domain specific extensions will be needed. BioHDF is about developing those extensions, through community support, to create a file format and accompanying library of software functions that are needed to build the scalable software applications of the future. More will follow, if you are interested contact me: todd at geospiza.com.

Labels: , , , , ,

Monday, July 14, 2008

Maq Attack

Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger center, for assembling Next Gen reads onto a reference sequence. Since Maq is widely used for working with Next Generation DNA sequence data, we chose to include support for Maq in our upcoming release of FinchLab. In this post, we will discuss integrating secondary analysis algorithms like Maq with the primary analysis and workflows in FinchLab.

Improving laboratory processes through immediate feedback

The cost to run Next Generation DNA sequencing instruments and the volume of data produced make it important for labs to be able to monitor their processes in real time. In the last post, I discussed how labs can get performance data and accomplish scientific goals during the three stages of data analysis. To quickly review: Primary data analysis involves converting image data to sequence data. Secondary data analysis involves aligning the sequences from the primary data analysis to reference data to create data sets that are used to develop scientific information. An example of a secondary analysis step would be assembling reads into contigs when new genomes are sequenced. Unlike the first two stages, where much of the data is used to detect errors and measure laboratory performance, the last stage is focused on the science. In the Tertiary data analyses genomes are annotated, and data sets are compared. Thus the tertiary analyses are often the most important in terms of gaining new insights. The data used in this phase must be vetted first. It must be high quality and free from systemic errors.

The companies producing Next Gen systems recognize the need to automate primary and secondary analysis. Consequently, they provide some basic algorithms along with the Next Gen instruments. Although these tools can help a lab get started, many labs have found that significant software development is needed on top of the starting tools if they are to fully automate their operation, translate output files into meaningful summaries, and give users easy access to the data. The starter kits from the instrument vendors can also be difficult to adapt when performing other kinds of experiments. Working with Next Gen systems typically means that you will have deal with a lot of disconnected software, a lack of user interfaces, and diverse new choices for algorithms when it comes to getting your work done.

FinchLab and Maq in an integrated system

The Geospiza FinchLab integrates analytical algorithms such as Maq into a complete system that encompasses all the steps in genetic analysis. Our Samples to Results platform provides flexible data entry interfaces to track sample meta data. The laboratory information management system is user configurable so that any kind of genetic analysis procedure can be run and tracked and most importantly provides tight linkage between samples, lab work, and their resulting data. This system makes it easy to transition high quality primary results to secondary data analysis.

One of the challenges with Next Gen sequencing has been choosing an algorithm for secondary analysis. Secondary data analysis needs to be adaptable to different technology platforms and algorithms for specialized sequencing applications. FinchLab meets this need because it can accommodate multiple algorithms when it comes to secondary and tertiary analysis. One of these algorithms is Maq. Maq attractive because it can be used in diverse applications where reads are aligned to a reference sequence. Among these are Transcriptomics (Tag Profiling, EST analysis, small RNA discovery), Promoter Mapping (CHiP-Seq, DNAase hypersensitivity), Methylation analysis, and Variation Analyses (SNP, CNV). Maq offers a rich set of output files so it can be used to quickly provide an overview of your data and help you verify that your experiment is on track before you invest serious time in tertiary work. Finally Maq is being actively developed and improved and is open-source so it is easy to access and use regardless of affiliation.

Maq and other algorithms are integrated into FinchLab through the FinchLab Remote Analysis Server (RAS). RAS is a lightweight job tracking system that can be configured to run any kind of program in different computing environments. RAS communicates with FinchLab to get the data and return the results. Data analyses are run in FinchLab by selecting the sequence file(s), clicking a link to go to a page and select the analysis method(s) and reference data sets, and then clicking a button to start the work. RAS tracks the details of data processing and sends information back to FinchLab so that you can always see what happening through the web interface.

A basic FinchLab system includes the RAS and pipelines for running Maq in two ways. The first is Tag Profiling and Expression Analysis. In this operation, Maq output files are converted to gene lists with links to drill down into the data and NCBI references. The second option it to use Maq in a general analysis procedure where all the output files are made available. In the next months, new tools will convert more of these files into output that can be added to genome browsers and other tertiary analysis systems.

A final strength of RAS is that it produces different kinds of log files to track potential errors. These kinds of files are extremely valuable in trouble-shooting and fixing problems. Since Next Gen technology is new and still in constant flux, you can be certain that unexpected issues will arise. Keeping the research on track is easier when informative RAS logging and reports help to diagnose and resolve issues quickly. Not only can FinchLab help with Next Gen assays, help solve those unexpected Next Gen problems, multiple Next Gen algorithms can be integrated into FinchLab to complete the story.

Labels: , , , , ,

Wednesday, June 25, 2008

Finch 3: Getting Information Out of Your Data

Geospiza's tag line "From Sample to Results" represents the importance of capturing information from all steps in the laboratory process. Data volumes are important and lots of time is being spent discussing the overwhelming volumes of data produced by new data collection technologies like Next Gen sequencers. However, the real issue is not how you are going to store the data, rather it is what are you going to do with it? What do your data mean in the context of your experiment?

The Geospiza FinchLab software system supports the entire laboratory and data analysis workflow to convert sample information into results. What this means is that the system provides a complete set of web-based interfaces and an underlying database to enter information about samples and experiments, track sample preparation steps in the laboratory, link the resulting data back to samples, and process the data to get biological information. Previous posts have focused on information entry, laboratory workflows, and data linking. This post will focus on how data are processed to get biological information.

The ultra-high data output of Next Gen sequencers allows us to use DNA sequencing to ask many new kinds of questions about structural and nucleotide variation and measure several indicators of expression and transcription control on a genome-wide scale. The data produced consists of images, signal intensity data, quality information, and DNA sequences and quality values. For each data collection run, the total collection of data and files can be enormous and can require significant computing resources. While all of the data have to be dealt with in some fashion, some of the data have long-term value while other data are only needed in the short term. The final scientific results will often be produced by comparing data sets created from the DNA sequences and their comparison to reference data.

Next Gen data are processed in three phases.

Next Gen data workflows involve three distinct phases of work: 1. Data are collected from control and experimental samples. 2. Sequence data obtained from each sample are aligned to reference sequence data, or data sets to produce aligned data sets 3. Summaries of the alignment information from the aligned data sets are compared to produce scientific understanding. Each phase has a discrete analytical process and we, and others, call these phases primary data analysis, secondary data analysis and tertiary data analysis.

Primary data analysis involves converting image data to sequence data. The sequence data can be in familiar "ACTG" sequence space or less familiar color space (SOLiD) or flow space (454). Primary data analysis is commonly performed by software provided by the data collection instrument vendor and it is the first place where quality assessment about a sequencing run takes place.

Secondary data analysis creates the data sets that will be further used to develop scientific information. This step involves aligning the sequences from the primary data analyses to reference data. Reference data can be complete genomes, subsets of genomic data like expressed genes, or individual chromosomes. Reference data are chosen in an application specific manner and sometimes multiple reference data sets will be used in an iterative fashion.

Secondary data analysis has two objectives. The first is to determine the quality of the DNA library that was sequenced, from a biological and sample perspective. The primary data analysis supplies quality measurements that can used to determine if the instrument ran properly, or whether the density of beads or clusters were at their optimum to deliver the highest number of high quality reads. However, those data do not tell you about the quality of the samples. Answering questions about sample quality, such as did the DNA library contain systematic artifacts such as sequence bias? Were there high numbers of ligated adaptors or incomplete restriction enzyme digests, or any other factors that would interfere with interpreting the data? These kinds of questions are addressed in the secondary data analysis by aligning your reads to the reference data and seeing that your data make sense.

The second objective of secondary data analysis is to prepare the data sets for tertiary analysis where they will be compared in an experimental fashion. This step involves further manipulation of alignments, typically expressed in very large hard to read algorithm specific tables, to produce data tables that can be consumed by additional software. Speaking of algorithms, there is a large and growing list to choose from. Some are general purpose and others are specific to particular applications, we'll comment more on that later.

Tertiary data analysis represents the third phase of the Next Gen workflow. This phase may involve a simple activity like viewing a data set in a tool like a genome browser so that the frequency of tags can be used to identify promoter sites, patterns of variation, or structural differences. In other experiments, like digital gene expression, tertiary analysis can involve comparing different data sets in a similar fashion to microarray experiments. These kinds of analyses are the most complex; expression measurements need to be normalized between data sets and statistical comparisons need to be made to assess differences.

To summarize, the goal of primary and secondary analysis is to produce well-characterized data sets that can be further compared to obtain scientific results. Well-characterized means that the quality is good for both the run and the samples and that any biologically relevant artifacts are identified, limited, and understood. The workflows for these analyses involve many steps, multiple scientific algorithms, and numerous file formats. The choices of algorithms, data files, data file formats, and overall number of steps depend the kinds of experiments and assays being performed. Despite this complexity there are standard ways to work with Next Gen systems to understand what you have before progressing through each phase.

The Geospiza FinchLab system focuses on helping you with both primary and secondary data analysis.

Labels: , , , , , ,

Friday, June 13, 2008

Finch 3, Linking Samples and Data

One of the big challenges with Next Gen sequencing is linking sample information with data. People tell us: "It's a real problem." "We use Excel, but it is hard." "We're losing track."

Do you find it hard to connect sample information with all the different types of data files? If so you should look at FinchLab.

A review:

About a month ago, I started talking about our third version of the Finch platform and introduced the software requirements for running a modern lab. To review, labs today need software systems that allow them to:

1. Set up different interfaces to collect experimental information
2. Assign specific workflows to experiments
3. Track the workflow steps in the laboratory
4. Prepare samples for data collection runs
5. Link data from the runs back to the original samples
6. Process data according to the needs of the experiment

In FinchLab, order forms are used to first enter sample information into the system. They can be created for specific experiments and the samples entered will, most importantly, be linked to the data that are produced. The process is straightforward. Someone working with the lab, a customer or collaborator, selects the appropriate form and fills out the requested information. Later, an individual in the lab reviews the order and, if everything is okay, chooses the "processing" state from a menu. This action "moves" the samples into the lab where the work will be done. When the samples are ready for data collection they are added to an "Instrument run." The instrument run is Finch's way of tracking which samples go in what well of a plate or lane/chamber on a slide. The samples are added to the instrument and data are collected.

The data

Now comes the fun part. If you have a Next Gen system you'll ultimately end up with 1000's of files scattered in multiple directories. The primary organization for the data will be in unix-style directories, which are like Mac or Windows folders. Within the directories you will find a mix of sequence files, quality files, files that contain information about run metrics and possibly images. You'll have to make decisions about what to save for long-term use and what to archive, or delete.

As noted, the instrument software organizes the data by the instrument run. However, a run can have multiple samples, and the samples can be from different experiments. A single sample can be spread over multiple lanes and chambers of a slide. If you are running a core lab, the samples will come from different customers and your customers often belong to different lab groups. And there is the analysis. The programs that operate on the data require specific formats for input files and produce many kinds of output files. Your challenge is to organize the data so that it is easy to find and access in a logical way. So what do you do?

Organizing data the hard way

If you do not have a data management system, you'll need to write down which samples go with which person, group or experiment. That's pretty simple. You can tape a piece of paper on the instrument and write this down, or you can diligently open a file, commonly an Excel spreadsheet, and record the info there. Not too bad, after all there are only a handful of partitions on a slide (2, 8, 16) and you only run the instrument once or twice a week. If you never upgrade your instrument, or never try and push too many samples through, then you're fine. Of course the less you run your instrument the more your data cost and the goal is to get really good at running your instrument, as frequently as possible. Otherwise you look bad at audit time.

Let's look at a scenario where the instrument is being run at maximal throughput. Over the course of a year, data from between 200 and 1000 slide lanes (chambers) may be collected. These data may be associated with 100's or 1000's of samples and belong to a few or many users in one or many lab groups. The relevant sequence files are between a few hundred megabytes to gigabytes in size; they exist in directories with run quality metrics and possibly analysis results. To sort this out you could have committee meetings to determine whether data should be organized by sample, experiment, user, or group, or you could just pick an organization. Once you've decided on your organization you have to set up access. Does everyone get a unix account? Do you set up SAMBA services? Do you put the data on other systems like Macs and PCs? What if people want to share? The decisions and IT details are endless. Regardless, you'll need a battery of scripts to automate moving data around to meet your organizational scheme. Or you could do something easier.

Organizing data the Finch way

One of FinchLab's many strengths is how it organizes Next Gen data. Because the system tracks samples and users, and has group and permissions models, issues related to data access and sharing are simplified. After a run is complete, the system knows which data files go to what samples. It also knows which samples were submitted by each user. Thus data can be maintained in the run directories that were created by the instrument software to simplify file-based organization. When a run is complete in FinchLab a data link is made to the run directory. The data link informs the system which files go with a run. Data processing routines in the system sort the data into sequences, quality metric files, and other data. At this stage data are associated with samples. Once this is done, the lab has easy access to the data via web pages. The lab can also make decisions about access to data and how to analyze the data. These last two features make FinchLab a powerful system for core labs and research groups. With only few clicks your data are organized by run, user, group, and experiment - and you didn't have to think about it.



Labels: , , , , , , ,

Thursday, June 5, 2008

Finishing in the Future

"The data sets are astronomical," "the data that needs to be attached to sequences is unbelievable," and "browsing [data] is incomprehensible." These are just three of the many quotes I heard about the challenges associated with DNA sequencing last week at the "Finishing in the Future Meeting" sponsored by the Joint Genome Institute (JGI) and Los Alamos National Laboratory (LANL).

Metagenomics

The two and half day conference, focused on finishing genomic sequences, kicked off with a session on metagenomics. Metagenomics is about isolating DNA from environments and sequencing random molecules to "see what's out there." Excitement for metagenomics is being driven by Next Gen sequencing throughput, because so many sequences can be collected relatively inexpensively. A benefit of being able to collect such large data sets is that we can interrogate organisms that can cannot be cultured. The first talk, "Defining the Human Microbiome: Friends or Family," was presented by Bruce Birren from the Broad Institute of MIT & Harvard. In this talk, we learned about the HMP (Human Microbiome Project), a project dedicated to characterizing the microbes that live on our bodies. It is estimated that microbial cells out number our cells by ten to one. It has long been speculated that our microbiomes are involved in our health and sickness and recent studies are confirming these ideas.

Sequencing technologies continue to increase data throughput

The afternoon session opened with presentations from Roche (454), Illumina, and Applied Biosystems on their respective Next Gen sequencing platforms. Each company presented the strengths of their platform and new discoveries that are being made by virtue of having a lot of data. Each company also presented data on improvements designed to produce even more data and road maps for future improvement to produce even more data. As Haley Fiske from Illumina put it, "we're in the middle of an arms race!" Finally, all the companies are working on molecular barcodes, so that multiple samples can be analyzed within an experiment. So, we started with a lot of data from a sample and are going to a lot of data from a lot of samples. That should add some very nice complexity to sample and data tracking.

A unique perspective

Sydney Brenner opened the second day with a talk on "The Unfinished Genome." The thing I like most about a Sydney Brenner talk is how he puts ideas together. In this talk he presented how one could look at existing data and literature to figure things out or make new discoveries. In one example, he speculated on when the genes for eye development may have first appeared. From the physiology of the eye you can use the biochemistry of vision to identify the genes that encode the various proteins involved in the process. These proteins are often involved in other process, but differ slightly. They arise from gene duplication and modification. So, you can look at gene duplications and measure the age of a duplication by looking at neighboring genes. If a duplication event is old, neighboring genes will be unequal distances apart. You can use this information, along with phylogenetic data, to estimate when the events occurred. Of course this kind of study benefits from more sequence data. Sydney encouraged everyone to keep sequencing.

Sydney closed his talk by making a fun analogy where genomics is like astronomy and thus should have been called "genomy." He supported his analogy by noting that astronomy has astro physic and genomics has genetics. Both are quantitative and measure history and evolution. Astronomy also has astrology, the prediction of an individual's future from the stars. Similarly, folks would like to predict an individual's future from their genes and suggested we call this work "Genology," since it has the same kind of scientific foundation as astrology.

Challenges and solutions

The rest of the conference and posters focused on finishing projects. Today the genome centers are making use of all the platforms to generate large data sets and finish projects. A challenge for genomics is lowering finishing costs. The problem being that generating "draft" data has become so inexpensive and fast that finishing has become a signifiant bottleneck. Finishing is needed to produce the high quality referece sequences that will inform our genomic science, so investigarting ways to lower finishing costs is a worthwhile endeavour. Genome centers are approaching this problem by looking at ways to mix data from different technologies such as 454 and Illumina or SOLiD. They are also developing new and mixed software approaches such as combining multiple assembly algorithms to improve alignments. These efforts are being conducted in conjunction with experiments where mixtures of single pass and paired read data sets are tested to determine optimal approaches for closing gaps.

The take home from this meeting is that, over the coming years, a multitude of new approaches and software programs will emerge to enable genome scale science. The current technology providers are aggressively working to increase data throughput, data quality and read length to make their platforms as flexible as possible. New technology providers are making progress on even higher throughput platforms. Computer scientists are working hard on new algorithms and data visualizations to handle the data. Molecular barcodes will allow for greater numbers of samples per data collection event and increase sample tracking complexity.

The bottom line

Individual research groups will continue to have increasing access to "genome center scale" technology. However, the challenges with sample tracking, data management, and data analysis will be daunting. Research groups with interesting problems will be cut off from these technologies unless they have access to cost-effective, robust informatics infrastructures. They will need help setting up their labs, organizing the data, and making use of new and emerging software technologies.

That's where Geospiza can help.

Labels: , , , , ,