Tuesday, April 13, 2010

Bloginar: Standardizing Bioinformatics with BioHDF (HDF5)

Yesterday we (The HDF Group and Geospiza) released the BioHDF prototype software.  To mark the occasion, and demonstrate some of BioHDF’s capabilities and advantages, I share the poster we presented at this year’s AGBT (Advances in Genome Biology and Technology) conference.

The following map guides the presentation. The poster has a title and four main sections, which cover background information, specific aspects of the general Next Generation Sequencing (NGS) workflow, and HDF5’s advantages for working with large amounts of NGS data.
 
Section 1.  The first section introduces HDF5 (Hierarchical Data Format) as a software platform for working with scientific data.  The introduction begins with the abstract and lists five specific challenges created by NGS: 1) high end computing infrastructures are needed to work with NGS data, 2) NGS data analysis involves complex multi-step processes that, 3) compare NGS data to multiple reference sequence databases, 4) the resulting datasets of alignments must be visualized in multiple ways, and 5) scientific knowledge is gained when many datasets are compared. 

Next, choices for managing NGS data are compared in a four category table.  These include text and binary formats. While text formats (delimited and XML) have been popular for bioinformatics, they do not scale well and binary formats are gaining in popularity. The current bioinformatics binary formats are listed (bottom left) along with a description of their limitations. 

The introduction closes with a description of HDF5 and its advantages for supporting NGS data management and analysis. Specifically, HDF5 is platform for managing scientific data. Such data are typically complex and consist of images, large multi-dimensional arrays, and meta data. HDF5 has been used for over 20 years in other data intensive fields; it is robust, portable, and tuned for high performance computing. Thus HDF5 is well suited for NGS. Indeed, groups from academic researchers to NGS instrument vendors, and software companies are recognizing the value of HDF5.
Section 2. This section illustrates how HDF5 facilitates primary data analysis. First we are reminded that NGS data are analyzed in three phases: primary analysis, secondary analysis and tertiary analysis. Primary analysis is the step that converts images to reads consisting of basecalls (or colors, or flowgrams), and quality values. In secondary analysis, reads are aligned to reference data (mapped) or amongst themselves (assembled). In many NGS assays, secondary analysis produces tables of alignments that must be compared to one and other, in tertiary analysis, to gain scientific insights. 

The remaining portion of section 2 shows how Illumina GA and SOLiD primary data (reads and quality values) can be stored in BioHDF and later reviewed using the BioHDF tools and scripts.  The resulting quality graphs are organized into three groups (left to right) to show base composition plots, quality value (QV) distribution graphs, and other summaries.

Base composition plots show the count of each base (or color) that occurs at a given position in the read. These plots are used to assess overall randomness of a library and observe systematic nucleotide incorporation errors or biases.

Quality value plots show the distribution of QVs at each base position within the ensemble of reads. As each NGS run produces many millions of reads, it is worthwhile summarizing QVs in multiple ways. The first plots, from the top, show the average QV per base with error bars indicating QVs that are within one standard deviation of the mean. Next, box and whisker plots show the overall quality distribution (median, lower and upper quartile, minimum and maximum values) at each position. These plots are followed by “error” plots which show the total count of QVs below certain thresholds (red, QV < 10; green QV < 20; blue, QV < 30). The final two sets of plots show the number of QVs at each position for all observed values and the number of bases having each quality value.

The final group of plots show overall dataset complexity, GC content (base space only), average QV/read, and %GC vs average QV (base space only).  Dataset complexity is computed by determining the number of times a given read exactly matches other reads in the dataset. In some experiments, too many identical reads indicates a problem like PCR bias. In other cases, like tag profiling, many identical reads are expected from highly expressed genes. Errors in the data can artificially increase complexity.
Section 3.  Primary data analysis gives us a picture of how well the samples were prepared or how well the instrument ran with some indication about sample quality. Secondary and tertiary analysis tell us about sample quality and more importantly, provides biological insights. The third section focuses on secondary and tertiary analysis and begins with a brief cartoon showing a high level data analysis workflow using BioHDF to store primary data, alignment results, and annotations. BioHDF tools are used to query these data and other software within GeneSifter is used to compare data between samples and display the data in interactive reports to examine the details from single or multiple samples.

The left side of this section illustrates what is possible with single samples. Beginning with a simple table that indicates how many reads align to each reference sequence, we can drill into multiple reports that provide increasing detail about the alignments. For example, the gene list report (second from top) uses gene model annotations to summarize the alignments for all genes identified in the dataset. Each gene is displayed as a thumbnail graphic that can be clicked to see greater detail, which is shown in the third plot. The Integrated Gene View not only shows the density of reads across the gene's genomic region, but also shows evidence of splice junctions, and identified single base differences (SNVs) and small insertions and deletions (indels). Navigation controls provide ways to zoom into and out of the current view of data, and move to new locations. Additionally, when possible, the read density plot is accompanied by an Entrez gene model and dbSNP data so that data can be observed in a context of known information. Tables that describe the observed variants follow. Clicking on a variant drills into the alignment viewer to show the reads encompassing the point of variation.

The right side illustrates multi-sample analysis in GeneSifter. In assays like RNA-Seq, alignment tables are converted to gene expression values that can be compared between samples. Volcano (top) and other plots are used visualize the differences between the datasets. Since each point in the volcano plot represents the difference in expression for a gene between two samples (or conditions), we can click on that point to view the expression details for that gene (middle) in the different samples. In the case of RNA-Seq, we can also obtain expression values for the individual exons with the gene, making it possible to observe differential exon levels in conjunction with overall gene expression levels (middle). Clicking the appropriate link in the exon expression bar graph, takes us to the alignment details for the samples being analyzed (bottom), in this example we have two cases and two control replicates. Like the single sample Integrated Gene Views, annotations are displayed with alignment data. When navigation buttons are clicked all of the displayed genes move together so that you can explore the gene's details and surrounding neighborhood for multiple samples in a comparative fashion.
Section 4.  The poster closes with details about BioHDF.  First, the data model is described. An advantage of the BioHDF model is that read data are organized non-redundantly. Other formats, like BAM, tend to store reads with alignments and if a read has multiple alignments in a genome, or is aligned to multiple reference sequences, it gets stored multiple times. This may seem trivial, but anything that can happen a million times, becomes noticeable. This fact is demonstrated in the in table listed in the second panel “High Performance Computing Advantages.”  Other HDF5 advantages are listed below the performance stats table.  Most notably is HDF5’s ability to easily support multiple indexing schemes like nested containment lists (NClists). NClists solve the problem of efficiently accessing reads from alignments that may be contained in other alignments, which I will save for a later post.

Finally, the poster is summarized with a number of take home points. These reiterate the fact that NGS is driving the need to use binary file formats to manage NGS and analysis results and that HDF5 provides an attractive solution because of its long history and development efforts that specifically target scientific programming requirements. In our hands, HDF5 has helped make GeneSifter a highly scalable and interactive web-application with less development effort than would have been needed to implement other technologies.  

If you are software developer and are interested in BioHDF please visit www.biohdf.org.  If you do not want to program and instead, want a way to easily analyze your NGS data to make new discoveries, please contact us

Labels: , , , , , , , ,

Sunday, March 14, 2010

Keeping Your DNA Sequencing, Genotyping, and Microarray Laboratory Competitive in a New Era of Genomics

ABRF 2010 is next week. The conference will be in sunny Sacramento CA. About 1000 technology geeks will convene to learn about the latest advances in DNA sequencing, genotyping, and proteomics instrumentation, lab protocols, and core lab services. We will be there with our booth and participate with LIMS and NGS data analysis presentations.

The first presentation, entitled "Keeping Your DNA Sequencing, Genotyping, and Microarray Laboratory Competitive in a New Era of Genomics," will be on Sunday Mar. 20 in the second concurrent workshop (w2) at 1:00 pm.

Abstract

Laboratory directors are facing enormous challenges with respect to keeping their laboratories competitive and retaining customers in the face of shrinking budgets and rapidly changing technology. A well-designed Laboratory Information Management System (LIMS) can help meet these challenges and manage costs as the scale and complexity of data collection and related services increase. LIMS can also offer competitive advantages through increased automation and improved customer experiences.

Implementing a LIMS strategy that will reduce data collection costs while improving competitiveness is a daunting proposition. LIMS are computerized data and information tracking systems that are highly variable with respect to their purpose, customization capabilities, and overall acquisition (initial purchase) and ownership (maintenance) costs. A simple LIMS can be built from a small number of spread sheets and track a few specific processes. Sophisticated LIMS rely on databases to manage multiple laboratory processes, capture and analyze different kinds of data, and provide decision support capabilities.

In this presentation, I will share 20 years of academic and industrial LIMS experiences and perspectives that have been informed through 100’s of interactions with core, research, and manufacturing laboratories engaged in DNA sequencing, genotyping, and microarrays. We’ll explore the issues that need to be addressed with respect to either building a LIMS, or acquiring a LIMS product. A new model that allows laboratories to offer competitive services, utilizing cost-effective laboratory automation strategies and new technologies like next generation sequencing, will be presented. We’ll also compare different IT infrastructures and discuss their advantages and how investments can be made to protect against unexpected costs as new instruments, like the HiSeq 2000(TM) or SOLiD 4 (TM), third generation sequencing, or other genetic analysis platforms are introduced.

Labels: , , ,

Tuesday, February 23, 2010

GeneSifter Lab Edition v3.14 - Release Notes

GeneSifter Laboratory Edition (GSLE) 3.14.0 introduces a host of new features and capabilities that make daily laboratory data management work even easier.  Read below to learn why GSLE is a leading LIMS product for all forms of DNA sequencing, microarrays, and other genetic analysis applications.

Orders and Invoices

Multi plate submissions: Order forms have been extended in several ways to further simplify how labs collect sample and project information. A new order form template lets core facilities, managing larger sequencing projects, easily receive samples and their information in a multiple plate format. New order fields specific to the plate format are included to support sample tracking and lab work.

Add data to fields: Orders forms have been further improved by adding the ability to add new values (or terms) to dropdown fields that already exist on published order forms.


Project field: Additionally, labs can add an optional project field to forms. With these improvements, labs can create forms that are easier to use and modify, as well as enable project tracking for their customers.

Sample location and sample selection: Two new features deliver help for labs that provide sample storage (biobanking) services to their clients. First, order forms can include sample location information. This is particularly useful in situations where samples are delivered in 96-well plates that are stored for later use. Second, samples already stored by the lab as purified DNA, RNA or other material (templates) can be selected from specialized search interfaces within order forms. Like all GSLE sample entry forms, these features can be included or not on a case-by-case basis depending on your specific needs. 

Invoice formatting: For labs that have the dreaded chore of sending billing data to accounting departments we have added the ability to modify the invoice number format to include additional characters that are used to distinguish which labs are sending information.

Laboratory Operations


GSLE provides the ability to create, list and follow steps in sample protocols (also called workflows). In 3.14 new features not only expand the capabilities but make it possible to further standardize procedures. 


Multiplexing: In Next Generation Sequencing (NGS) several libraries are often combined into a single lane or region of a slide to increase the number of individual samples analyzed in a sequencing run. As each library is prepared, a specific adaptor sequence is added so sequence reads corresponding to different samples can be identified by their adaptor tag. This procedure, called multiplexing or barcoding, is supported in 3.14 and allows the lab to combine samples and adaptor sequences and group the combination of libraries together (Worksets) for sample processing and instrument runs. Once data are collected, sample naming conventions, combined with adaptor sequence (Multiplex Identifier, MID) stored in sample sheets, are used to separate individual reads into files corresponding to the samples that were in the original workset.

Batch data entry: Some lab processes require that samples are manipulated in groups (batches), but laboratory data are collected for individual samples within the batch. For example, the concentrations of individual DNA samples may need to be measured in a 96-well plate. To improve how the OD values, comments, or other information are entered, workflow steps have been updated to include batch data entry forms that provide spreadsheet like data entry capabilities. Like all GSLE batch data entry forms, data can be entered easily using the form’s column highlight and easy fill controls, or uploaded from an excel spreadsheet.

Subsample processing: GSLE 3.14 also increases sample processing flexibility. As noted above, order forms can now support the ability to select samples that are already stored in the system. This feature is further extended into the laboratory by creating tools that allow many new samples to be created from a “parent” or stock samples. When new samples (templates) are created, options are provided so that each new sample can be entered into a different process. For example, you receive a tissue sample that needs several experiments performed; RNA-Seq, ChIP-Seq and resequencing. Now you can easily pick the sample and create three new sub samples defining which process will be performed on each sample with just a few clicks.

Selecting samples based on custom data: Some labs need to use custom data entered into order forms to sort and filter samples in the lab. For example, an order form may ask a researcher to enter read lengths for their NGS run. A 36 base run is much faster than a 100 base run, and on some platforms costs less. Thus, the lab will sort samples based on read length prior to the data collection event. While always possible to get this information in many GSLE displays, 3.14 adds new capabilities to use any custom data in its specialized sample picker tools.

Other Features

Customer data management: GSLE v3.14 gives labs’ customers increased ability to organize their chromatograms, fragment analysis files and microarray files as needed. Data files can be edited, relabeled, moved or deleted. Projects and folders can be created, modified or deleted to aid in data organization.

Application Programming Interface (Onsite Installations Only)

SQL-API: As automation and system integration needs increase, requirements for supporting programmatic data entry become more important. GSLE has continued to expand the self-documenting Application Programming Interface (API). We have also added an SQL API that can be used to create custom reports that are accessed via a wget style unix command.


Input API enhancements: The Input API now returns success IDs and CGI parameter names have been eliminated. The full documentation can be reviewed by contacting support@geospiza.com for the GSLE SQL API Manual or the GSLE Input API Manual. 


Next Generation Analysis Transfer Tool (Hosted Partners Only)

Simplified data transfers: A data transfer interface has been added to connect GSLE and GeneSifter Analysis Edition (GSAE). Partner Program administrators use the interface to select data files in GSLE and transfer them to their customer’s account in GSAE.

Schema Table update note


There was an update to an existing schema table;  the column "Plate_Label" is now in table om_sample_plate instead of om_order.

Labels: , , , , ,

Thursday, December 31, 2009

2009 Review

The end of the year is a good time to reflect, review accomplishments, and think about the year to come. 2009 was a good year for Geospiza’s customers, with many exciting accomplishments for the company. Highlights are reviewed below.

Two products form a complete genetic analysis system


Geospiza’s two core products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), help laboratories do their work and scientists analyze their data. GSLE is the LIMS (Laboratory Information Management System) that laboratories, from service labs to high-throughput data production centers, use to collect information about samples, track and manage laboratory procedures, organize and process data, and deliver data and results back to researchers. GSLE supports traditional DNA sequencing (Sanger), fragment analysis, genotyping, microarrays, Next Generation Sequencing (NGS) and other technologies.

In 2008, Geospiza released the third version of the platform (back then it was known as FinchLab). This version launched a new way of providing LIMS solutions. Traditional LIMS systems require extensive programming and customization to meet a laboratory’s specific requirements. They include a very general framework designed to support a wide range of activities. Their advantage is that they are highly customizable. However, this advantage comes at the expense of very high acquisition costs accompanied by lengthy requirements planning and programming before they become operational.

In contrast, GSLE contains default settings that support genetic analysis out-of-the-box, while allowing laboratories to customize operations without programmer support. Default settings in GSLE suppport DNA sequencing, microarray, and genotyping services. The GSLE abstraction layer supports extensive configuration to meet specific needs as they arise. Through this design, the costs of acquiring and operating a high-quality advanced LIMS system are significantly reduced.

Throughout 2009, 100’s of features were added to GSLE to increase support for instruments and data types, and improve how laboratory procedures (workflows) are created, managed, and shared. Enhancements were made to features like experiment ordering, organization, and billing. We also added new application programming interfaces (APIs) to enable integration with enterprise software. Specific highlights included:
  • Extending microarray support to include sample sheet generation and automate uploading files
  • Improving NGS file and data browsing to simplify the process of searching and viewing the 1000’s of files produced in Next Gen sequencing runs
  • Making NGS data downloads, of very large gigabase files, robust and easy
  • Adding worksets to group DNA and RNA samples in customized ways that facilitate laboratory processing
  • Creating APIs to utilize external password servers and programmatically receive data using GSLE form objects
  • Enhancing ways for groups to add HTML to pages to customize their look and feel
In addition to the above features, we’ve also completed development on methods to multiplex NGS samples and track MIDs (molecular identifiers and molecular barcodes), enter laboratory data like OD values and bead counts in batches, create orders with multiple plates, and access SQL queries through an API. Look for these great features and more in the early part of 2010.

GSAE

As noted, GSAE is Geospiza’s data analysis product. While GSLE is capable of running of running advanced data analysis pipelines, the primary focus of data analysis in GSLE is to provide quality control. Thus its data analyses and presentation focus on single samples. GSAE provides the infrastructure and tools to compare the results between samples. In the case of NGS, GSAE also provides more reports and data interactions. GSAE began as a web-based microarray data analysis platform making it well suited for NGS-based gene expression assays. Over 2009 many new features were added to extend its utility to NGS data analysis with a focus on whole transcriptome analysis. Highlights included:
  • Developing data analysis pipelines for RNA-Seq, Small RNA, ChIP-Seq, and other kinds of NGS assays
  • Adding tools to visualize and discover alternatively spliced transcripts in gene expression assays
  • Extending expression analysis tools to include interactive volcano plots, unbalanced two-way ANOVA computations
  • Increasing NGS transcriptome analysis capabilities to include variation detection and visualization
The above features fulfill the requirements needed to make a platform complete for both NGS and microarray-based gene expression analysis. And, the addition of variation detection and visualization lays the groundwork for GSAE to extend its market leadership to resequencing data analysis.

Geospiza Research

In 2009 Geospiza won two research awards in the form of Phase II STTR and Phase I SBIR grants. The STTR project is researching new ways to organize, compress, and access NGS data by adapting HDF technologies to bioinformatics. Through this work we are developing a robust data management infrastructure that supports our NGS sequencing analysis pipelines and interactive user interfaces. The second award targets NGS-based variation detection. This work began in the last quarter of the year, but is already delivering new ways to identify and visualize variants in RNA-Seq and whole transcriptome analysis.

To learn more about our progress in 2009, visit our news page. It includes our press releases and reports in the news, publications citing our software, and webinars where we have presented our latest and greatest.

As we close 2009, we especially want to thank our customers and collaborators for their support in making the year successful and we look forward to an exciting year ahead in 2010.

Labels: , , , ,

Sunday, November 1, 2009

GeneSifter Laboratory Edition Update

GeneSifter Laboratory Edition has been updated to version 3.13. This release has many new features and improvements that further enhance its ability to support all forms of DNA sequencing and microarray sample processing and data collection.

Geospiza Products

Geospiza's two primary products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), form a complete software system that supports many kinds of genomics and genetic analysis applications. GSLE is the LIMS (Laboratory Information Management System) that is used by core labs and service companies worldwide that offer DNA sequencing (Sanger and Next Generation), microarray analysis, fragment analysis and other forms of genotyping. GSAE is the analysis system researchers use to analyze their data and make discoveries. Both products are actively updated to keep current with latest science and technological advances.

The new release of GSLE helps labs share workflows, perform barcode-based searching, view new data reports, simplify invoicing, and automate data entry through a new API (application programming interface).

Sharing Workflows

GSLE laboratory workflows make it possible for labs to define and track their protocols and data that are collected when samples are processed. Each step in a protocol can be configured to collect any kind of data, like OD values, bead counts, gel images and comments, that are used to record sample quality. In earlier versions, protocols could be downloaded as PDF files that list the steps and their data. With 3.13, a complete workflow (steps, rules, custom data) can be downloaded as an XML file that can be uploaded into another GSLE system to recreate the entire protocol with just a few clicks. This feature simplifies protocol sharing and makes it possible for labs to test procedures in one system and add them to another when they are ready for production.

Barcode Searching and Sample Organization

Sometimes a lab needs to organize separate tubes in 96-well racks for sample preparation. Assigning each tube's rack location can be an arduous process. However, if the tubes are labeled with barcode identifiers, a bed scanner can be used to make the assignments. GSLE 3.13 provides an interface to upload bed scanner data and assign tube locations in a single step. Also, new search capabilities have been added to find orders in the system using sample or primer identifiers. For example, orders can be retrieved by scanning a barcode from a tube in the search interface.


Reports and Data

Throughout GSLE, many details about data can be reviewed using predefined reports. In some cases, pages can be quite long, but only a portion of the report is interesting. GSLE now lets you collapse sections of report pages to focus on specific details. New download features have also been added to better support access to those very large NGS data files.

GSLE has always been good at identifying duplicate data in the system, but not always as good at letting you decide how duplicate data are managed. Managing duplicate data is now more flexible to better support situations where data need to be reanalyzed and reloaded.

The GSLE data model makes it possible to query the database using SQL. In 3.13, the view tables interface has been expanded so that the data stored in each table can be reviewed with a single click.

Invoices

Core lab's that send invoices will benefit from changes that make it possible to download many PDF formatted orders and invoices into a single zipped folder. Configurable automation capabilities have also been added to set invoice due dates and generate multiple invoices from a set of completed orders.

API Tools

As automation and system integration needs increase, external programs are used to enter data from other systems. GSLE 3.13 supports automated data entry through a novel self-documenting API. The API takes advantage of GSLE's built in data validation features that are used by the system's web-based forms. At each site, the API can be turned on and off by on-site administrators and its access can be limited to specific users. This way, all system transactions are easily tracked using existing GLSE logging capabilities. In addition to data validation and access control, the API is self-documenting. Each API containing form has a header that includes key codes, example documentation, and features to view and manually upload formatted data to test automation programs and help system integrators get their work done. GSLE 3.13 further supports enterprise environments with an improved API that is used to query external password authentication servers.

Labels: , , , , , ,

Tuesday, May 12, 2009

Small RNAs Get Smaller

Tiny RNAs recently joined the growing list of non-coding RNA (ncRNA) molecules [1]. Their absolute function is not understood, but they are possibly a new class of ncRNA and appear to be most associated with transcription of highly expressed genes in human, chickens and Drosophila and possibly others.

This was the conclusion of work published in the May issue of Nature Genetics. Remember when all we had to worry about was the central dogma? DNA was transcribed into RNA and RNA was translated into protein. Life was so simple.

Not really. Even as the first genetic code was being elucidated [2], the possibility of uncovering the second code, that translates nucleic acid sequences into protein sequences was being contemplated [3]. Translating RNA into protein required other kinds of RNA that became known as ribosomal (rRNA) and transfer RNA (tRNA). The RNA between DNA and protein became messenger (m)RNA. In the late seventies, introns were discovered [4,5] and soon to follow were small nuclear (sn)RNAs and “snurps” (small nuclear riboproteins). The snRNAs were further characterized as small nucleolar (snoRNA) and Cajal body-specific (scaRNA) RNAs, and a class of new molecules were investigated for their involvement in mRNA splicing.

As the mechanisms for splicing were being worked out, researchers were able to prove that RNA could also be an enzyme [6]. In this case, the intron is the enzyme responsible for splicing itself out to create the mature mRNA. At the same time, another group discovered that the catalytic unit of RNAase-P, an enzyme involved in converting precursor tRNAs into active tRNAs, is also RNA [7]. Indeed, later work revealed that rRNA in the large ribosome subunit catalyzes the peptidyl transferase reaction to join amino acids together to build proteins [8]. Not only does the central dogma require a multitude of RNA molecules to transcribe DNA into RNA and translate RNA into protein, but the RNA molecules are responsible for carrying the information needed to make proteins and supplying the enzymatic activity to do the work!

What else does RNA do?

More than we can imagine. Starting with the discovery that double stranded RNA (dsRNA) could inhibit gene expression by turning on RNA interference (RNAi) pathways [9], new RNAs were identified, micro (miRNA) and small interfering (siRNA), as essential to the RNAi pathway. miRNA and siRNA were the early members of what would become a large and growing class of RNAs now referred to as non-coding RNAs (ncRNAs).

The ncRNAs represent a next frontier in RNA research and understanding gene expression. Some ncRNAs are large, like lincRNAs (large intervening non-coding RNAs) [10], but most are small between 18 and 31 nt. Within in the small ncRNA group are piwi-interacting (piRNA), repeat associated small interfering (rasiRNA), small temporal (stRNA), and now transcription initiation (tiRNA) RNA. I like tiny RNA.

Tiny, or tiRNAs, were discovered by Next Generation Sequencing (NGS) studies. RNA libraries were prepared from specific size fractions of capped messages. The resulting libraries were sequenced on the Roche FLX Genome Sequencing system and the data were aligned to human genome build 36.1 and compared to transcription start sites (TSS) defined by RefGene (NCBI). The authors reasoned the previous deep-sequencing studies missed these RNA molecules because they tend to be disregarded as low-abundance spurious, or degradation products. However, because they can be cloned, they must have a 5’ phosphate and, when aligned to genomc sequences, the NGS reads cluster in a non-random fashion around TSSs.

GeneSifter enables small RNA research

NGS makes it possible to explore the RNA world in new ways by designing experiments to capture small RNA molecules and sequence them in a massively parallel, high throughput format. However, both the experiments and data analysis are technically challenging. Fortunately GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE) can help. In GSLE you can use the software to track RNA preparation steps and record data at different points of the process. GSAE is accompanied with data analysis pipelines designed to filter artifacts and identify known small RNAs. Post alignment clustering reports, based on coverage in a genome, can be used to further refine results an discover new RNA species as well. Moreover, you can convert the clustering reports into lists of expression values for these RNAs and compare their expression between different samples, tissues, or experimental conditions.


References
1. Taft R.J., Glazov E.A., Cloonan N., et. al., 2009. Tiny RNAs associated with transcription start sites in animals. Nat Genet 41, 572-578.

2. Watson J.D. and Crick F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953)

3. Crick F.H., Barnett L., Brenner S., Watts-Tobin R.J., 1961. General nature of the genetic code for proteins. Nature 192, 1227-1232.

4. Chow L.T., Roberts J.M., Lewis J.B., Broker T.R., 1977. A map of cytoplasmic RNA transcripts from lytic adenovirus type 2, determined by electron microscopy of RNA:DNA hybrids. Cell 11, 819-836.

5. Berk A.J., Sharp P.A., 1977. Sizing and mapping of early adenovirus mRNAs by gel electrophoresis of S1 endonuclease-digested hybrids. Cell 12, 721-732.

6. Zaug A.J., Cech T.R., 1982. The intervening sequence excised from the ribosomal RNA precursor of Tetrahymena contains a 5-terminal guanosine residue not encoded by the DNA. Nucleic Acids Res 10, 2823-2838.

7. Guerrier-Takada C., Gardiner K., Marsh T., Pace N., Altman S., 1983. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35, 849-857.

8. Nissen P., Hansen J., Ban N., Moore P.B., Steitz T.A., 2000. The structural basis of ribosome activity in peptide bond synthesis. Science 289, 920-930.

9. Fire A., Xu S., Montgomery M.K., Kostas S.A., Driver S.E., Mello C.C., 1998. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806-811.

10. Guttman M., Amit I., Garber M., French C., Lin M.F., Feldser D., Huarte M., Zuk O., Carey B.W., Cassady J.P., Cabili M.N., Jaenisch R., Mikkelsen T.S., Jacks T., Hacohen N., Bernstein B.E., Kellis M., Regev A., Rinn J.L., Lander E.S., 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223-227.


Further Reading
ncRNA - http://nar.oxfordjournals.org/cgi/reprint/35/suppl_1/D178
snoRNA - http://en.wikipedia.org/wiki/SnoRNA
siRNA - http://en.wikipedia.org/wiki/SiRNA
miRNA - http://en.wikipedia.org/wiki/MicroRNA
piRNA - http://en.wikipedia.org/wiki/Piwi-interacting_RNA
rasiRNA - http://en.wikipedia.org/wiki/RasiRNA
stRNA - http://jcs.biologists.org/cgi/content/full/116/23/4689
Ribozymes - http://en.wikipedia.org/wiki/Ribozyme


Databases
miRBASE - http://microrna.sanger.ac.uk/sequences/
RNAdb - http://research.imb.uq.edu.au/rnadb/default.aspx

Labels: , , , , ,

Tuesday, April 28, 2009

Life in the Clouds

Today, the Applied Biosystems division of Life Technologies announced their partnership with us (Geospiza) to use Amazon Web Services (AWS) to use cloud-computing technologies to help customers manage data from advanced genomic analysis platforms.

This news is significant for several reasons.

First, as noted by our President, Rob Arnold, in Xconomy, this is the first time a leading gene sequencing company has agreed to offer the sequencing instrument, the consumable chemicals needed to run experiments, and the software needed to sort through and make sense of the data, in a single package and run the software under a SaaS model.

Second, through this news, and other activities, we are proactively addressing one of the major challenges for Next Generation Sequencing. Specifically, that the costs and time for purchasing, deploying, and maintaining IT systems to support NGS data management and analysis are simply out of reach for the vast majority of research groups that can benefit from these technologies. Presently, the groups making the greatest progress with NGS have advanced bioinformatics and IT support. However, if these technologies are going to truly meet their promise of revolutionizing genomics, the numbers of scientists utilizing NGS needs to increase. All over the country (and world) medical researchers, microbiologists, plant biologists, and other scientists have interesting samples and new ideas for which NGS experiments will provide amazing discoveries, but they can only follow through on those ideas if they can work with their data in a reasonable cost effective way.

Finally, today's news is gratifying because it validates one of Geospiza's important early technology decisions. Cloud-computing, also called "Software as a Service" (SaaS) is not new to Geospiza. When we started in 1997, we made an important decision to develop our platform as a web-based system. We've been working with web technology and Internet-based services right from the beginning. Back then we used the term ASP (Application Service Provider) instead of SaaS, but we have had clients effectively using our systems this way for many years. Our long experience with cloud computing has prepared us to meet the new challenges created by NGS and we look forward to working with Applied Biosystems on the AWS platform to extend the number of options we can provide to our customers, helping to lower their computing costs, and working to enable their science.

For more information visit:

Geospiza Cloud
Press Release

FinchTalks where SaaS is discussed:

Have we been deluged?
Three Themes from AGBT and ABRF Part III: The IT Problem
Next Gen-Omics

Focus on Next Gen Sequencing
Closing 2008

Labels: , , , ,

Sunday, March 8, 2009

Bloginar: Next Gen Laboratory Systems for Core Facilities

Geospiza kicked off February by attending the AGBT and ABRF conferences. As part of our participation at ABRF, we presented a scenario, in our poster, where a core lab provides Next Generation Sequencing (NGS) transcriptome analysis services. This story shows how GeneSifter Lab and Analysis Edition’s capabilities overcome the challenges of implementing NGS in a core lab environment.

Like the last post, which covered our AGBT poster, the following poster map will guide the discussion.


As this poster overlaps the previous poster in terms providing information about RNA assays and analyzing the data, our main points below will focus on how GeneSifter Lab Edition solves challenges related to laboratory and business processes associated with setting up a new lab for NGS or bringing NGS into an existing microarray or Sanger sequencing lab.

Section 1 contains the abstract, an introduction to the core laboratory, and background information on different kinds of transcription profiling experiments.

The general challenge for a core lab lies in the need to run a business that offers a wide variety of scientific services for which samples (physical materials) are converted to data and information that have biological meaning. Different services often require different lab processes to produce different kinds of data. To facilitate and direct lab work, each service requires specialized information and instructions for samples that will be processed. Before work is started, the lab must review the samples and verify that the information has been correctly delivered. Samples are then routed through different procedures to prepare them for data collection. In the last steps, data are collected, reviewed, and the results are delivered back to clients. At the end of the day (typically monthly), orders are reviewed and invoices are prepared either directly or by updating accounting systems.

In the case of NGS, we are learning that the entire data collection and delivery process gets more complicated. When compared to Sanger sequencing, genotyping, or other assays that are run in 96-well formats, sample preparation is more complex. NGS requires that DNA libraries be prepared and different steps of the of process need to be measured and tracked in detail. Also, complicated bioinformatics workflows are needed to understand the data from both a quality control and biological meaning context. Moreover, NGS requires a substantial investment in information technology.

Section 2 walks through the ways in which GeneSifter Lab Edition helps to simplify the NGS laboratory operation.

Order Forms

In the first step, an order is placed. Screenshots show how GeneSifter can be configured for different services. Labs can define specialized form fields using a variety of user interface elements like check boxes, radio buttons, pull down menus, and text entry fields. Fields can be required or be optional and special rules such as ranges for values can be applied to individual fields within specific forms. Orders can also be configured to take files as attachments to track data, like gel images, about samples. To handle that special “for lab use only" information, fields in forms can be specified as laboratory use only. Such fields are hidden to the customers view and when the orders are processed they are filled later by lab personnel. The advantage of GeneSifter’s order system is that the pertinent information is captured electronically in the same system that will be used to track sample processing and organize data. Indecipherable paper forms are eliminated along with the problem of finding information scattered on multiple computers.

Web-forms do create a special kind of data entry challenge. Specifically, when there is a lot of information to enter for a lot samples, filling in numerous form fields on a web-page can be a serious pain. GeneSifter solves this problem in two ways:

First, all forms can have “Easy Fill” controls that provide column highlighting (for fast tab-and-type data entry), auto fill downs, and auto fill downs with number increments so one can easily “copy” common items into all cells of a column, or increment an ending number to all values in a column. When these controls are combined with the “Range Selector,” a power web-based user interface makes it easy to enter large numbers of values quickly in flexible ways.

Second, sometimes the data to be entered is already in an Excel spreadsheet. To solve this problem, each form contains a specialized Excel spreadsheet validator. The form can be downloaded as an Excel template and the rules, previously assigned to field when the form was created, are used to check data when they are uploaded. This process spots problems with data items and reports ten at upload time when they are easy to fix, rather than later when information is harder to find. This feature eliminates endless cycles of contacting clients to get the correct information.

Laboratory Processing

Once order data are entered, the next step is to process orders. The middle of section 2 describes this process using an RNA-Seq assay as an example. Like other NGS assays, the RNA-Seq protocol has many steps involving RNA purification, fragmentation, random primed conversion into cDNA, and DNA library preparation of the resulting cDNA for sequencing. During the process, the lab needs to collect data on RNA and DNA concentration as well as determine the integrity of the molecules throughout the process. If a lab runs different kinds of assays they will have to manage multiple procedures that may have different requirements for ordering of steps and laboratory data that need to be collected.

By now it is probably not a surprise to learn that GeneSifter Lab Edition has a way to meet this challenge too. To start, workflows (lab procedures) can be created for any kind of process with any number of steps. The lab defines the number of steps and their order and which steps are required (like the order forms). Having the ability to mix required and optional steps in a workflow gives a lab the ultimate flexibility to support those “we always do it this way, except the times we don’t” situations. For each step the lab can also define whether or not any additional data needs to be collected along the way. Numbers, text, and attachments are all supported so you can have your Nanodrop and Bioanalyzer too.

Next, an important feature of GeneSifter workflows is that a sample can move from one workflow to another. This modular approach means that separate workflows can be created for RNA preparation, cDNA conversion, and sequencing library preparation. If a lab has multiple NGS platforms, or a combination of NGS and microarrays, they might find that a common RNA preparation procedure is used, but the processes diverge when the RNA is converted into forms for collecting data. For example, aliquots of the same RNA preparation may be assayed and compared on multiple platforms. In this case a common RNA preparation protocol is followed, but sub-samples are taken through different procedures, like a microarray and NGS assay, and their relationship to the “parent” sample must be tracked. This kind of scenario is easy to set up and execute in GeneSifter Lab Edition.

Finally, one of GeneSifter’s greatest advantages is that a customized system with all of the forms, fields, Excel import features, and modular workflows can be added by lab operators without any programming. Achieving similar levels of customization with traditional LIMS products takes months and years with initial and reoccurring costs of six or more figures.

Collecting Data

The last step of the process is collecting the data, reviewing it, and making sequences and results available to clients. Multiple screenshots illustrate how this works in GeneSifter Lab Edition. For each kind of data collection platform, a “run” object is created. The run holds the information about reactions (the samples ready to run) and where they will be placed in the container that will be loaded into the data collection instrument. In this context, the container is used to describe 96 or 384-well plates, glass slides with divided areas called lanes, regions, chambers, or microarray chips. All of these formats are supported and in some cases specialized files (sample sheets, plate records) are created and loaded into instrument collection software to inform the instrument about sample placement and run conditions for individual samples.

During the run, samples are converted to data. This process, different for each kind of data collection platform, produce variable numbers and kinds of files that are organized in completely different ways. Using tools that work with GeneSifter, raw data and tracking information are entered into the database to simplify access to the data at a later time. The database also associates sample names and other information with data files, eliminating the need to rename files with complex tracking schemes. The last steps of the process involve reviewing quality information and deciding whether to release data to clients or repeat certain steps of the process. When data are released, each client receives an email directing them to their data.

The lab updates the orders and optionally creates invoices for services. GeneSifter Lab Edition can be used to manage those business functions as well. We’ll cover GeneSifter’s pricing and invoicing tools at some other time, be assured they are as complete as the other parts of the system.

NGS requires more than simple data delivery

Section 3 covers issues related to the computational infrastructure needed to support NGS and the data analysis aspects of the NGS workflow. In this scenario, our core lab also provides data analysis services to convert those multi-million read files into something that can be used to study biology. Much of this covered in the previous post, so it will not be repeated here.

I will summarize by making the final points that Geospiza’s GeneSifter products cover all aspects of setting up a lab for NGS. From sample preparation, to collecting data, to storing and distributing results, to running complex bioinformatics workflows and presenting information in ways to get scientifically meaningful results, a comprehensive solution is offered. GeneSifter products can be delivered as hosted solutions to lower costs. Our hosted, Software as a Service, solutions allow groups to start inexpensively and manage costs as the needs scale. More importantly, unlike in-house IT systems, which require significant planning and implementation time to remodel (or build) server rooms and install computers, GeneSifter products get you started as soon as you decide to sign up.

Labels: , , , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Thursday, February 19, 2009

Three Themes from AGBT and ABRF Part II: The Bioinformatics Bottleneck

In my last post, I summarized the presentations and conversations regarding Next Gen Sequencing (NGS) challenges in terms of three themes: You have to pay attention to details in the laboratory, bioinformatics is a bottleneck, and the IT burden is significant. In that post, I discussed the issues related to the laboratory and how GeneSifter Lab Edition overcomes those challenges.

This post tackles the second theme: the bioinformatics bottleneck.

In the Sanger days, bioinformatics was really a challenge for only the highest throughput facilities like genome centers. In these labs, streamlined workflows (pipelines) were developed for the different kinds of sequencing (genomes, ESTs[expressed sequence tags, 1], SAGE [serial analysis of gene expression, 2]). Because, Sanger sequencing was high cost and low throughput, compared to NGS, the cost of developing the bioinformatics pipelines was low relative to the cost of collecting the data. Thus, large-scale projects such as whole genome shotgun sequencing, ESTs, or resequencing studies, could be supported by a handful of pipelines that changed infrequently. In addition, small-scale Sanger projects could be handled well by desktop software and services like NCBI BLAST.

NGS breaks the Sanger paradigm. A single NGS instrument has the same throughput as an entire warehouse of Sanger instruments. To illustrate this point in a striking way, we can look at dbEST - NCBI’s database of ESTs. Today, there are approximately 59 million reads in this database, representing the total accumulation of sequencing projects over a 10 year period. Considering that one run of an Illumina GA or SOLiD can produce between 80 and 180 million reads in week or two we can now, in a single week, produce up to three times more ESTs than we have seen deposited over the past 10 years. These numbers also dwarf the amount of data collected from other gene expression analysis systems like microarrays and sequencing techniques like SAGE.


The emergence of the bioinformatics bottleneck

The bioinformatics bottleneck is related to the fact that NGS platforms are general purpose; they collect sequence data. That’s it. Because they collect a lot of data very quickly, we can use sequences as data points for many different kinds of measurements. When we think this way, an extremely wide range of experiments can be conceived.

From sequencing complete genomes, to sampling genes in the environment, to measuring mutations in cancer, to understanding epigenomics, to measuring gene expression and the transcriptome, NGS applications are proliferating at a rapid pace. However, each experiment requires a specialized bioinformatics pipeline and the algorithms used within a bioinformatics pipelines must be tuned for the data produced from the different sequencing platforms and questions being asked. When these considerations are combined with other issues like what reference data to use for sequence comparisons the number of bioinformatics pipelines can grow in a combinatorial fashion.

The early recommendation is that each lab wanting to do NGS work needs to have a dedicated bioinformatics professional. In more than one talk, presenters even quantified bioinformatics support in terms of FTEs (full time equivalents) per instrument. Bioinformatics is needed in both the sequencing laboratory, to develop and maintain quality control pipelines, and in the research environment, to process (align) the data, mine the output for interesting features, and perform comparative analyses between datasets.

But this won’t work

It is clear that bioinformatics is critical to understanding the data being produced. However, the current recommendation that any group planning NGS experiments should also have a dedicated bioinformatician is impractical for several reasons.

First, the model of a bioinformatician for every lab is simply not scalable. Fundamentally, there are not enough people that understand the science, programming, statistics, and other resources such as different forms of reference data, algorithms, and data types needed to make sense of NGS data. We see plenty of evidence, in the literature and presentations, that there are many outstanding people doing this work and contributing to the community, the problem is that they already have jobs!

Even if we consider that the above model is workable, hiring people takes significant time, is expensive, and ongoing costs are going to be high. These time and cost investments only become reasonable when a significant number of experiments are planned. One or two instruments will produce between 25 and 50 runs worth of data per year. If you calculate instrument costs, reagents, salary, and overhead costs, you are quickly into many thousands of dollars per sample. Indeed, a theme expressed in the bioinformatics bottleneck is that bioinformatics is becoming the single largest ongoing cost of NGS. Add in the IT computer support (next post) and you better have a plan for running a lot more than 50 runs per year. Remember the first issue - good bioinformaticians with NGS analysis experience have jobs.

If you have access to bioinformatics support, or can hire an individual, that person will quickly become overwhelmed with work. The biggest reason is that the software infrastructures needed to quickly develop new pipelines, automate them, and deliver data in ways that can be consumed by non-programming scientists are typically lacking. The result is that scientific programming efforts generally turn into lengthy software development projects because without an infrastructure, the numbers and kinds of experiments quickly grow past beyond the capacity of a single individual.

So, What can be done?

Geospiza solves the bioinformatics challenge in multiple ways. GeneSifter Lab and Analysis editions provide a platform that delivers the complete infrastructures needed to deploy NGS data processing pipelines and deliver results through web-based interfaces. These systems include pipelines for many of the common NGS applications such as transcription analysis, small RNA detection, ChIP-Seq and other assays. The system architecture and accompanying API creates a framework to quickly add new pipelines and make the results available to biologists running the experiments.

For those with access to bioinformatics help, GeneSifter will make your team more productive because developers will be freed of the burden of having to create the automation and delivery infrastructure, enabling them to focus on new scientific programming problems. For those without access to such resources, we have many pipelines ready to go. Moreover, because we have a platform and the infrastructure already built, as well as deep bioinformatics experience, we can create and deliver new analysis pipelines quickly. Finally, our product development roadmap is well-aligned with the most common NGS assays which means we you can probably do your bioinformatics analysis today!

References: 

1. Adams M.D., Soares M.B., Kerlavage A.R., Fields C., Venter J.C., 1993. Rapid cDNA sequencing (expressed sequence tags) from a directionally cloned human infant brain cDNA library. Nat Genet 4, 373-380.

2. Velculescu V.E., Zhang L., Vogelstein B., Kinzler K.W., 1995. Serial analysis of gene expression. Science 270, 484-487.

Labels: , , , ,

Sunday, February 15, 2009

Three Themes from ABRF and AGBT Part I: The Laboratory Challenge

It's been an exciting week on the road at the AGBT and ABRF conferences. From the many presentations and discussions it is clear that the current and future next generation DNA sequencing (NGS) technologies are changing the way we think about genomics and molecular biology. It is also clear that successfully using these technologies impacts research and core laboratories in three significant areas:
  1. The Laboratory: Running successful experiments requires careful attention to detail.
  2. Bioinformatics: Every presentation called out bioinformatics as a major bottleneck. The data are hard to work with and different NGS experiments require different specialized bioinformatics workflows (pipelines).
  3. Information Technology (IT): The bioinformatics bottleneck is exacerbated by IT issues involving data storage, computation, and data transfer bandwidth.

We kicked off ABRF by participating in the Next Gen DNA Sequencing workshop on Saturday (Feb. 7). It was extremely well attended with presentations on experiences in setting up labs for Next Gen sequencing, preparing DNA libraries for sequencing, and dealing with the IT and bioinformatics.

I had the opportunity to provide the “overview” talk. In that presentation “From Reads to Datasets, Why Next Gen is not Sanger Sequencing,” I focused on the kinds of things you can do with NGS technology, its power, and the high level issues that groups are facing today when implementing these systems. I also introduced one of our research projects on developing scalable infrastructures using HDF5 for Next Gen bioinformatics and high-performing, dynamic, software interfaces. Three themes resufraced again and again throughout the day:  one must pay attention to laboratory details, bioinformatics is a bottleneck, and don't underestimate the impact of NGS systems on IT.

In this post, I'll discuss the laboratory details and visit the other themes in posts to come.

Laboratory Management

To better understand the impact of NGS on the lab, we can compare it to Sanger sequencing. In the table below, different categories ranging from the kinds of samples, to their preparation, to the data, are considered to show how NGS differs from Sanger sequencing. Sequencing samples for example are very different between Sanger and NGS. In Sanger sequencing, one typically works with clones or PCR amplicons. Each sample (clone or PCR product) produces a single sequence read. Overall, sequencing systems are robust, so the biggest challenges to labs has been tracking the samples as they move from tube to plate or between wells within plates.

In contrast, NGS experiments involve sequencing DNA libraries and each sample produces millions of reads. Presently, only a few samples are sequenced at a time so the sample tracking issues, when compared to Sanger, are greatly reduced. Indeed, one of the significant advantages and cost savings of NGS is to eliminate the need for cloning or PCR amplification in preparing templates to sequence.

Directly sequencing DNA libraries is a key ability and a major factor that makes NGS so powerful. It also directly contributes to the bioinformatics complexity (more on that in the next post). Each one of the millions of reads that are produced from the sample corresponds to an individual molecule, present in the DNA library. Thus, the overall quality of the data and the things you can learn are a direct function of the library.



Producing good libraries requires that you have a good handle on many factors. To begin, you will need to track RNA and DNA concentrations, at different steps of the process. You also need to know the “quality” of the molecules in the sample. For example, RNA assays will give the best results when RNA is carefully prepared and free of RNAses. In RNA-Seq, the best results are obtained when the RNA is fragmented prior to cDNA synthesis. To understand the quality of the starting RNA, fragmentation, and cDNA synthesis steps, tools like agarose gels or Bioanalyzer traces are used to evaluate fragment lengths and determine overall sample quality. Other assays and sequencing projects have similar processes. Throughout both conferences, it was stressed that regardless of whether you are sequencing genomes, small RNAs, performing an RNA-Seq, or other “tag and count” kinds of experiments, you need to pay attention to the details of the process. Tools like the NanoDrop, or QPCR procedure need to be routinely used to measure RNA or DNA concentration. Tools like gels and the Bioanalyzer are used to measure sample quality. And, in many cases both kinds of tools are used.

Through many conversations, it became clear that Bioanalyzer images, Nanodrop reports, and other lab data quickly accumulate during these kinds of experiments. While an NGS experiment is in progress, these data are pretty accessible and the links between data quality and the collected data are easy to see. It only takes a few weeks, however,  for these lab data to disperse.  They find their way into paper notebooks, or unorganized folders on multiple computers. When the results from one sample need to be compared to another,  a new problem appears. It becomes harder and harder to find the lab data that correspond to each sample.

To summarize, NGS technology makes it possible to interrogate large ensembles of individual RNA or DNA molecules. Different questions can be asked by preparing the ensembles (libraries) in different ways involving complex procedures. To ensure that the resulting data are useful, the libraries need to be of high and known quality. Quality is measured with multiple tools at different points of the process to produce multiple forms of laboratory data. Traditional methods such as laboratory notebooks, files on computers, and post-it notes however, make these data hard to find when the time comes to compare results between samples.

Fortunately, the GeneSifter Lab Edition solves these challenges. The Lab Edition of Geospiza’s software platform provides a comprehensive laboratory information management system (LIMS) for NGS and other kinds of genetic analysis assays, experiments, and projects. Using web-based interfaces, laboratories can define protocols (laboratory workflows) with any number of steps. Steps may be ordered and required to ensure that procedures are correctly followed. Within each step, the laboratory can define and collect different kinds of custom data (Nanodrop values, Bioanalyzer traces, gel images, ...). Laboratories using the GeneSifter Lab Edition can produce more reliable information because they can track the details of their library preparation and link key laboratory data to sequencing results.

Labels: , , , , ,