Wednesday, February 3, 2010

Sneak Peak: Data Analysis Methods for Whole Transcriptome Sequencing Applications – Challenges and Solutions

RNA sequencing is one of the most popular Next Generation Sequencing (NGS) applications. Next Thursday, February 11 at 10:00 A.M. PDT (1:00 P.M. EDT), we kick off our 2010 webinar series with a presentation designed to help you understand whole transcriptome data analysis and what can be learned in these experiments. In addition, we will show off some of our latest tools and interfaces that can be used to discover new RNAs, new splice forms of transcripts, and alleles of expressed genes.

Summary

RNA sequencing applications such as Whole Transcriptome Analysis, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 500 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample (gene level expression summaries, exon usage, splice junction, single nucleotide variants, insertions and deletions), these applications are also ideal for the identification of novel RNAs as well as novel splicing events.

This presentation will provide an overview of Whole Transcriptome data analysis workflows with emphasis on calculating gene and exon level expression values as well as identifying splice junctions and variants from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and Lifetech’s SOLiD instruments.

Register Today!

Labels: , , , ,

Sunday, January 31, 2010

Next Generation Sequencing: Deep and Fast!

The Next Generation Sequencing field is fast moving. A simple NGS search in PubMed already yields 51 articles for 2010. At his rate there will be over 600 papers this year that use NGS, review NGS progress, or introduce new NGS analysis methods.

The references retrieved from the search are listed below.

1: Marguerat S, Bahler J. RNA-seq: from technology to biology. Cell Mol Life Sci. 2010 Feb;67(4):569-79. Epub 2009 Oct 27. Review. PubMed PMID: 19859660.

2: Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. Target-enrichment strategies for next-generationsequencing. Nat Methods. 2010 Feb;7(2):111-8. PubMed PMID: 20111037.

3: Jex AR, Hall RS, Littlewood DT, Gasser RB. An integrated pipeline for next-generation sequencing and annotation of mitochondrial genomes. Nucleic Acids Res. 2010 Feb;38(2):522-33. Epub 2009 Nov 5. PubMed PMID: 19892826; PubMed Central PMCID: PMC2811008.

4: Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Jan 28. [Epub ahead of print] PubMed PMID: 20110278.

5: Gottlieb B, Beitel LK, Alvarado C, Trifiro MA. Selection and mutation in the "new" genetics: an emerging hypothesis. Hum Genet. 2010 Jan 23. [Epub ahead of print] PubMed PMID: 20099069.

6: Monger WA, Alicai T, Ndunguru J, Kinyua ZM, Potts M, Reeder RH, Miano DW, Adams IP, Boonham N, Glover RH, Smith J. The complete genome sequence of the Tanzanian strain of Cassava brown streak virus and comparison with the Ugandan strain sequence. Arch Virol. 2010 Jan 22. [Epub ahead of print] PubMed PMID: 20094895.

7: Popp C, Dean W, Feng S, Cokus SJ, Andrews S, Pellegrini M, Jacobsen SE, Reik W. Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency. Nature. 2010 Jan 22. [Epub ahead of print] PubMed PMID: 20098412.

8: Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Li J, Zhang Z, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder OA, Leung FC, Zhou Y, Cao J, Sun X, Fu Y, Fang X, Guo X, Wang B, Hou R, Shen F, Mu B, Ni P, Lin R, Qian W, Wang G, Yu C, Nie W, Wang J, Wu Z, Liang H, Min J, Wu Q, Cheng S, Ruan J, Wang M, Shi Z, Wen M, Liu B, Ren X, Zheng H, Dong D, Cook K, Shan G, Zhang H, Kosiol C, Xie X, Lu Z, Zheng H, Li Y, Steiner CC, Lam TT, Lin S, Zhang Q, Li G, Tian J, Gong T, Liu H, Zhang D, Fang L, Ye C, Zhang J, Hu W, Xu A, Ren Y, Zhang G, Bruford MW, Li Q, Ma L, Guo Y, An N, Hu Y, Zheng Y, Shi Y, Li Z, Liu Q, Chen Y, Zhao J, Qu N, Zhao S, Tian F, Wang X, Wang H, Xu L, Liu X, Vinar T, Wang Y, Lam TW, Yiu SM, Liu S, Zhang H, Li D, Huang Y, Wang X, Yang G, Jiang Z, Wang J, Qin N, Li L, Li J, Bolund L, Kristiansen K, Wong GK, Olson M, Zhang X, Li S, Yang H, Wang J, Wang J. The sequence and de novo assembly of the giant panda genome. Nature. 2010 Jan 21;463(7279):311-7. Epub 2009 Dec 13. PubMed PMID: 20010809.

9: Falgueras J, Lara AJ, Fernandez-Pozo N, Canton FR, Perez-Trabado G, Claros MG. SeqTrim: a high-throughput pipeline for preprocessing any type of sequence reads. BMC Bioinformatics. 2010 Jan 20;11(1):38. [Epub ahead of print] PubMed PMID: 20089148.

10: Beck AH, Weng Z, Witten DM, Zhu S, Foley JW, Lacroute P, Smith CL, Tibshirani R, van de Rijn M, Sidow A, West RB. 3'-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One. 2010 Jan 19;5(1):e8768. PubMed PMID: 20098735; PubMed Central PMCID: PMC2808244. 11: Webb KM, Rosenthal BM. Deep resequencing of Trichinella spiralis reveals previously un-described single nucleotide polymorphisms and intra-isolate variation within the mitochondrial genome. Infect Genet Evol. 2010 Jan 18. [Epub ahead of print] PubMed PMID: 20083232.

12: Galvez S, Diaz D, Hernandez P, Esteban FJ, Caballero JA, Dorado G. Next-Generation Bioinformatics: Using Many-Core Processor Architecture to Develop a Web Service for Sequence Alignment. Bioinformatics. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20081221.

13: Gilchrist E, Haughn G. Reverse genetics techniques: engineering loss and gain of gene function in plants. Brief Funct Genomic Proteomic. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20081218.

14: Lavoie PM, Dube MP. Genetics of bronchopulmonary dysplasia in the age of genomics. Curr Opin Pediatr. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20087186.

15: Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, Cregan PB. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics. 2010 Jan 15;11(1):38. [Epub ahead of print] PubMed PMID: 20078886.

16: Pleasance ED, Stephens PJ, O'Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, Varela I, Nik-Zainal S, Davies HR, Ordonez GR, Mudie LJ, Latimer C, Edkins S, Stebbings L, Chen L, Jia M, Leroy C, Marshall J, Menzies A, Butler A, Teague JW, Mangion J, Sun YA, McLaughlin SF, Peckham HE, Tsung EF, Costa GL, Lee CC, Minna JD, Gazdar A, Birney E, Rhodes MD, McKernan KJ, Stratton MR, Futreal PA, Campbell PJ. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010 Jan 14;463(7278):184-90. Epub 2009 Dec 16. PubMed PMID: 20016488.

17: Myles S, Chia JM, Hurwitz B, Simon C, Zhong GY, Buckler E, Ware D. Rapid genomic characterization of the genus vitis. PLoS One. 2010 Jan 13;5(1):e8219. PubMed PMID: 20084295; PubMed Central PMCID: PMC2805708.

18: Santuari L, Pradervand S, Amiguet-Vercher AM, Thomas J, Dorcey E, Harshman K, Xenarios I, Juenger TE, Hardtke CS. Substantial deletion overlap among divergent Arabidopsis genomes revealed by intersection of short reads and tiling arrays. Genome Biol. 2010 Jan 12;11(1):R4. [Epub ahead of print] PubMed PMID: 20067627.

19: Pool J, Hellmann I, Jensen J, Nielsen R. Population genetics of genome-scale sequence variation. Genome Res. 2010 Jan 12. [Epub ahead of print] PubMed PMID: 20067940.

20: Duncan EL, Brown MA. Mapping genes for osteoporosis-Old dogs and new tricks. Bone. 2010 Jan 11. [Epub ahead of print] PubMed PMID: 20060943.

21: Byrne SL, Durandeau K, Nagy I, Barth S. Identification of ABC transporters from Lolium perenne L. that are regulated by toxic levels of selenium. Planta. 2010 Jan 9. [Epub ahead of print] PubMed PMID: 20063009.

22: Zhao CZ, Xia H, Frazier TP, Yao YY, Bi YP, Li AQ, Li MJ, Li CS, Zhang BH, Wang XJ. Deep sequencing identifies novel and conserved microRNAs in peanut (Arachis hypogaea L.). BMC Plant Biol. 2010 Jan 5;10(1):3. [Epub ahead of print] PubMed PMID: 20047695.

23: Medina M, Sachs JL. Symbiont genomics, our new tangled bank. Genomics. 2010 Jan 4. [Epub ahead of print] PubMed PMID: 20053372.

24: Hittinger CT, Johnston M, Tossberg JT, Rokas A. Leveraging skewed transcript abundance by RNA-Seq to increase the genomic depth of the tree of life. Proc Natl Acad Sci U S A. 2010 Jan 4. [Epub ahead of print] PubMed PMID: 20080632.

25: Volpi L, Roversi G, Colombo EA, Leijsten N, Concolino D, Calabria A, Mencarelli MA, Fimiani M, Macciardi F, Pfundt R, Schoenmakers EF, Larizza L. Targeted next-generation sequencing appoints c16orf57 as clericuzio-type poikiloderma with neutropenia gene. Am J Hum Genet. 2010 Jan;86(1):72-6. Epub 2009 Dec 10. PubMed PMID: 20004881.

26: Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437-55. PubMed PMID: 20059347.

27: Clement NL, Snell Q, Clement MJ, Hollenhorst PC, Purwar J, Graves BJ, Cairns
BR, Johnson WE. The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics. 2010 Jan 1;26(1):38-45. Epub 2009 Oct 27. PubMed PMID: 19861355.

28: Arner E, Hayashizaki Y, Daub CO. NGSView: an extensible open source editor for next-generation sequencing data. Bioinformatics. 2010 Jan 1;26(1):125-6. Epub 2009 Oct 24. PubMed PMID: 19855106; PubMed Central PMCID: PMC2796816.

29: Jex AR, Littlewood DT, Gasser RB. Toward next-generation sequencing of mitochondrial genomes--focus on parasitic worms of animals and biotechnological implications. Biotechnol Adv. 2010 Jan-Feb;28(1):151-9. Epub . Review. PubMed PMID: 19913084.

30: Jex AR, Gasser RB. Genetic richness and diversity in Cryptosporidium hominis
and C. parvum reveals major knowledge gaps and a need for the application of "next generation" technologies--research review. Biotechnol Adv. 2010 Jan-Feb;28(1):17-26. Epub . Review. PubMed PMID: 19699288.

31: Chou LS, Liu CS, Boese B, Zhang X, Mao R. DNA sequence capture and enrichment by microarray followed by next-generation sequencing for targeted resequencing: neurofibromatosis type 1 gene as a model. Clin Chem. 2010 Jan;56(1):62-72. Epub 2009 Nov 12. PubMed PMID: 19910506.

32: Nagalakshmi U, Waern K, Snyder M. RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010 Jan;Chapter 4:Unit 4.11.1-13. PubMed PMID: 20069539.

33: Fullwood MJ, Han Y, Wei CL, Ruan X, Ruan Y. Chromatin interaction analysis using paired-end tag sequencing. Curr Protoc Mol Biol. 2010 Jan;Chapter 21:Unit 21.15.1-25. PubMed PMID: 20069536.

34: Roukos DH. Novel clinico-genome network modeling for revolutionizing genotype-phenotype-based personalized cancer care. Expert Rev Mol Diagn. 2010 Jan;10(1):33-48. PubMed PMID: 20014921.

35: Liu S, Chen HD, Makarevitch I, Shirmer R, Emrich SJ, Dietrich CR, Barbazuk WB, Springer NM, Schnable PS. High-throughput genetic mapping of mutants via quantitative single nucleotide polymorphism typing. Genetics. 2010 Jan;184(1):19-26. Epub 2009 Nov 2. PubMed PMID: 19884313.

36: Day IN. dbSNP in the detail and copy number complexities. Hum Mutat. 2010 Jan;31(1):2-4. PubMed PMID: 20024941.

37: Hamady M, Lozupone C, Knight R. Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J. 2010 Jan;4(1):17-27. Epub 2009 Aug 27. PubMed PMID: 19710709; PubMed Central PMCID: PMC2797552.

38: Aparicio SA, Huntsman DG. Does massively parallel DNA resequencing signify the end of histopathology as we know it? J Pathol. 2010 Jan;220(2):307-15. PubMed PMID: 19921711.

39: Bell DW. Our changing view of the genomic landscape of cancer. J Pathol. 2010 Jan;220(2):231-43. PubMed PMID: 19918804.

40: Nobuta K, McCormick K, Nakano M, Meyers BC. Bioinformatics analysis of small RNAs in plants using next generation sequencing technologies. Methods Mol Biol. 2010;592:89-106. PubMed PMID: 19802591.

41: Salmon A, Ainouche ML. Polyploidy and DNA methylation: new tools available. Mol Ecol. 2010 Jan;19(2):213-5. PubMed PMID: 20078770.

42: Gathering clouds and a sequencing storm: why cloud computing could broaden community access to next-generation sequencing. Nat Biotechnol. 2010 Jan;28(1):1. PubMed PMID: 20062015.

43: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Jan;11(1):31-46. Epub 2009 Dec 8. Review. PubMed PMID: 19997069.

44: Northcott PA, Rutka JT, Taylor MD. Genomics of medulloblastoma: from Giemsa-banding to next-generation sequencing in 20 years. Neurosurg Focus. 2010 Jan;28(1):E6. PubMed PMID: 20043721.

45: Yang JH, Shao P, Zhou H, Chen YQ, Qu LH. deepBase: a database for deeply annotating and mining deep sequencing data. Nucleic Acids Res. 2010 Jan;38(Database issue):D123-30. Epub 2009 Dec 4. PubMed PMID: 19966272; PubMed Central PMCID: PMC2808990.

46: Shumway M, Cochrane G, Sugawara H. Archiving next generation sequencing data. Nucleic Acids Res. 2010 Jan;38(Database issue):D870-1. Epub 2009 Dec 3. PubMed PMID: 19965774; PubMed Central PMCID: PMC2808927.

47: Brooksbank C, Cameron G, Thornton J. The European Bioinformatics Institute's
data resources. Nucleic Acids Res. 2010 Jan;38(Database issue):D17-25. Epub 2009 Nov 24. PubMed PMID: 19934258; PubMed Central PMCID: PMC2808956.

48: Kim P, Yoon S, Kim N, Lee S, Ko M, Lee H, Kang H, Kim J, Lee S. ChimerDB 2.0--a knowledgebase for fusion genes updated. Nucleic Acids Res. 2010 Jan;38(Database issue):D81-5. Epub 2009 Nov 11. PubMed PMID: 19906715; PubMed Central PMCID: PMC2808913.

49: Leinonen R, Akhtar R, Birney E, Bonfield J, Bower L, Corbett M, Cheng Y, Demiralp F, Faruque N, Goodgame N, Gibson R, Hoad G, Hunter C, Jang M, Leonard S, Lin Q, Lopez R, Maguire M, McWilliam H, Plaister S, Radhakrishnan R, Sobhany S, Slater G, Ten Hoopen P, Valentin F, Vaughan R, Zalunin V, Zerbino D, Cochrane G. Improvements to services at the European Nucleotide Archive. Nucleic Acids Res. 2010 Jan;38(Database issue):D39-45. Epub 2009 Nov 11. PubMed PMID: 19906712; PubMed Central PMCID: PMC2808951.

50: Kaminuma E, Mashima J, Kodama Y, Gojobori T, Ogasawara O, Okubo K, Takagi T, Nakamura Y. DDBJ launches a new archive database with analytical tools for next-generation sequence data. Nucleic Acids Res. 2010 Jan;38(Database issue):D33-8. Epub 2009 Oct 22. PubMed PMID: 19850725; PubMed Central PMCID:
PMC2808917.

51: Jing H. [Advances in approaches for the quantitative detection of microRNAs.]. Yi Chuan. 2010 Jan;32(1):31-40. Chinese. PubMed PMID: 20085883.

Monday, January 25, 2010

Grant Opportunities for Next Generation DNA Sequencing

As we close the first month of 2010, it is time to get your pencils sharpened and submit proposals for new shared instruments.

The National Center for Research Resources (NCRR) announced that it has $43M to fund equipment purchases in 2011. With this money, NCRR expects to make approximately 125 new award for instruments that cost at least $100,000 but less than $600,000. NCRR proposals are due March 23, 2010.

In addition to NCRR, the National Science Foundation (NSF), through its Major Research Instrumentation (MRI) program, has $90M to make 150 awards of between $100,000 and $4M for shared instrumentation. MRI proposals are due April 21, 2010.

Remember, when preparing proposals a sound informatics plan will make your application stand out. Contact us if you’d like more information.

Labels: , ,

Monday, January 18, 2010

Systems Biology with HDF5

As many are aware, Geospiza and The HDF Group are collaborating to extend HDF (Hierarchical Data Format) technologies to support the data management needs of high performance computing applications in genomics. As we do this work, others are also adopting HDF5 as a data storage technology to work with different kinds of biological data.

The Association for Computing Machinery (ACM) recently published an article, "Unifying Biological Image Formats with HDF5," that argues for using HDF5 and HDF tools as a common framework for working with image files. This article is worth reading for several reasons.

First, it provides a nice introduction and background to HDF5, its origins, and movement towards becoming an ISO standard. HDF5's technical features are also included in this discussion.

Next, a brief history of the imaging community is covered to share how X-ray crystallographers, electron, and optical microscopists had all independently considered HDF5 as a framework for their next-generation image file formats. Through this discussion, the challenges that have been identified within the imaging community are listed.

Like genomics, the amounts of data being collected are ever increasing, current formats are inflexible and difficult to adapt to future modalities and dimensionality, and the nonarchival quality of data undermines long-term value. That is, current data typically lack sufficient metadata about their origins and experiments to be useful in the long-term.

The article goes on to make the point that current challenges with image data could be addressed if the community adopts an existing format that can support both generic and specialized data formats and meet a set of common requirements related to performance, interoperability, and archiving. Examples of how HDF5 meets these requirements are included. Briefly, HDF5's data caching can be used to overcome computation bottlenecks related to the fact that image sizes are exceeding RAM capacity. Interoperability issues can be addressed through HDF5's ability to store multiple metadata schemas in flexible ways. And, because HDF5 is self describing, data stored in HDF5 can be better preserved.

Finally, a barrier to moving to a new technology is supporting legacy applications that may be costly to replace. Thus, the article closes with a creative proposal for supporting legacy software applications and recommendations for future development. HDF5 files could support legacy software applications if they were able to present the data, stored within the HDF5 file, as the collection of directories and files required by the legacy application. This could be accomplished by developing an abstraction layer that could interact with FUSE (Filesystem in User Space) and essentially mount the HDF5 file as a virtual file system. Such a scenario is only possible because data are stored in HDF5 in a general way that can be further abstracted and presented in multiple specific ways.

While this article focused on issues related to image formats, there are many parallels that the genomics and Next Generation Sequencing communities should pay attention to, and if you are a bioinformatics software developer or running bioinformatics projects, you should put this paper on your must read list.

Labels: , , ,

Wednesday, January 13, 2010

2010 sequencing starts in style

Next Generation Sequencing (NGS) is a hot topic. As we kick off 2010, many themes continue. Data throughput is increasing, sequencing costs are decreasing, and NGS still requires extensive informatics support.

Throughput up, costs down

As sequencing throughput increases, the costs for collecting sequencing data decrease. Illumina is setting the pace for 2010 by announcing its latest sequencing instrument, the HiSeq2000. Illumina’s press release, news reports, and the blogosphere enthusiastically report on the instrument’s five fold increase in data throughput and ability to sequence an entire human genome in about one week for about $10,000.

What about the informatics?

This month’s reviews and editorials in Nature Reviews Genetics (NRG) and Nature Biotechnology (NBT), respectively, claim that the most significant NGS challenge continues to be dealing with the data. As pointed out in the NRG editorial, it is quite possible that the community will produce more sequence data this year than has been cumulatively produced in the past 10 years. The HiSeq, developments that will be announced by Applied Biosystems in February, and the coming single molecule sequencers support this. The editorial further makes the point that genome centers have the computing infrastructure to deal with the data, but the larger community of researchers, who could benefit from these technologies, do not. A similar observation was made at the end of the NBT review which pointed out that costs associated with downstream handling and processing of the data will possibly equal or exceed data collection costs.

The significance of the informatics challenge is that wide adoption of NGS technologies assumes that we have usable solutions for working with the data. These solutions go beyond simply getting a computer cluster with a sequencing instrument. To be useful, that cluster needs to reside in an adequately air conditioned room, be operated by people who know how to work with cluster hardware and software and can also optimize networks to manage the flow of data. Other individuals are needed who can write programs and scripts to process the data, work with multiple database technologies, and develop scalable user interfaces to visualize and navigate through the results and compare information between multiple samples and experiments.

The conversation about the informatics problem began with the introduction of NGS technologies. In 2008, Nature Methods (July) and NBT (October) published editorials speaking to the coming challenges. Later in 2009, Science published a new article about data intensive science. Previous FinchTalks have discussed the articles and their significance and the theme has remained the same; both the access to computing technologies and the skills needed to use the data are unavailable to the large numbers of researchers who need to use these technologies to remain competitive.

There is a solution

One solution to the informatics challenge created by NGS, and other data intensive technologies, is to make use of the immense Internet-based computing infrastructure that has been created by companies like Amazon, Google, Yahoo, and others. Also called Cloud Computing, Internet-based services remove many of the hardware and infrastructure barriers for utilizing high performance computing and storage technology. This message was delivered by the 2010 NBT kick off editorial and accompanying news feature, along with the next important message that software solutions also need to be adapted to cloud environments. Here the editorial, like many other descriptions of NGS informatics needs, falls short in that they only focus on alignment programs. Simply adapting alignment algorithms using technologies like Hadoop to employ Cloud-based high performance computing clusters is not a sufficient solution.

Aligning billions of reads to reference data quickly and accurately is clearly important. However it is just the first step of a complex analysis process. The subsequent steps of analyzing the billions of alignments to filter artifacts, identify true and new variation between sequences, discover alternative splice forms in transcripts, and compare data between samples are even more challenging.

Fortunately Geospiza understands the problem well. As our tag line, From Samples to ResultsTM, suggests, our lab and analysis systems focus on solving a complete set of problems that need to be addressed in order to do good science with NGS and other genetic analysis technologies.

Perhaps this is way we were the only software provider discussed in the NBT news feature, “Up in a cloud.”

Labels: , , ,

Thursday, December 31, 2009

2009 Review

The end of the year is a good time to reflect, review accomplishments, and think about the year to come. 2009 was a good year for Geospiza’s customers, with many exciting accomplishments for the company. Highlights are reviewed below.

Two products form a complete genetic analysis system


Geospiza’s two core products, GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE), help laboratories do their work and scientists analyze their data. GSLE is the LIMS (Laboratory Information Management System) that laboratories, from service labs to high-throughput data production centers, use to collect information about samples, track and manage laboratory procedures, organize and process data, and deliver data and results back to researchers. GSLE supports traditional DNA sequencing (Sanger), fragment analysis, genotyping, microarrays, Next Generation Sequencing (NGS) and other technologies.

In 2008, Geospiza released the third version of the platform (back then it was known as FinchLab). This version launched a new way of providing LIMS solutions. Traditional LIMS systems require extensive programming and customization to meet a laboratory’s specific requirements. They include a very general framework designed to support a wide range of activities. Their advantage is that they are highly customizable. However, this advantage comes at the expense of very high acquisition costs accompanied by lengthy requirements planning and programming before they become operational.

In contrast, GSLE contains default settings that support genetic analysis out-of-the-box, while allowing laboratories to customize operations without programmer support. Default settings in GSLE suppport DNA sequencing, microarray, and genotyping services. The GSLE abstraction layer supports extensive configuration to meet specific needs as they arise. Through this design, the costs of acquiring and operating a high-quality advanced LIMS system are significantly reduced.

Throughout 2009, 100’s of features were added to GSLE to increase support for instruments and data types, and improve how laboratory procedures (workflows) are created, managed, and shared. Enhancements were made to features like experiment ordering, organization, and billing. We also added new application programming interfaces (APIs) to enable integration with enterprise software. Specific highlights included:
  • Extending microarray support to include sample sheet generation and automate uploading files
  • Improving NGS file and data browsing to simplify the process of searching and viewing the 1000’s of files produced in Next Gen sequencing runs
  • Making NGS data downloads, of very large gigabase files, robust and easy
  • Adding worksets to group DNA and RNA samples in customized ways that facilitate laboratory processing
  • Creating APIs to utilize external password servers and programmatically receive data using GSLE form objects
  • Enhancing ways for groups to add HTML to pages to customize their look and feel
In addition to the above features, we’ve also completed development on methods to multiplex NGS samples and track MIDs (molecular identifiers and molecular barcodes), enter laboratory data like OD values and bead counts in batches, create orders with multiple plates, and access SQL queries through an API. Look for these great features and more in the early part of 2010.

GSAE

As noted, GSAE is Geospiza’s data analysis product. While GSLE is capable of running of running advanced data analysis pipelines, the primary focus of data analysis in GSLE is to provide quality control. Thus its data analyses and presentation focus on single samples. GSAE provides the infrastructure and tools to compare the results between samples. In the case of NGS, GSAE also provides more reports and data interactions. GSAE began as a web-based microarray data analysis platform making it well suited for NGS-based gene expression assays. Over 2009 many new features were added to extend its utility to NGS data analysis with a focus on whole transcriptome analysis. Highlights included:
  • Developing data analysis pipelines for RNA-Seq, Small RNA, ChIP-Seq, and other kinds of NGS assays
  • Adding tools to visualize and discover alternatively spliced transcripts in gene expression assays
  • Extending expression analysis tools to include interactive volcano plots, unbalanced two-way ANOVA computations
  • Increasing NGS transcriptome analysis capabilities to include variation detection and visualization
The above features fulfill the requirements needed to make a platform complete for both NGS and microarray-based gene expression analysis. And, the addition of variation detection and visualization lays the groundwork for GSAE to extend its market leadership to resequencing data analysis.

Geospiza Research

In 2009 Geospiza won two research awards in the form of Phase II STTR and Phase I SBIR grants. The STTR project is researching new ways to organize, compress, and access NGS data by adapting HDF technologies to bioinformatics. Through this work we are developing a robust data management infrastructure that supports our NGS sequencing analysis pipelines and interactive user interfaces. The second award targets NGS-based variation detection. This work began in the last quarter of the year, but is already delivering new ways to identify and visualize variants in RNA-Seq and whole transcriptome analysis.

To learn more about our progress in 2009, visit our news page. It includes our press releases and reports in the news, publications citing our software, and webinars where we have presented our latest and greatest.

As we close 2009, we especially want to thank our customers and collaborators for their support in making the year successful and we look forward to an exciting year ahead in 2010.

Labels: , , , ,

Sunday, December 6, 2009

Expeditiously Exponential: Genome Standards in a New Era

One of the hot topics of 2009 has been the exponential growth in genomics and other data and how this growth will impact data use and sharing. The journal Science explored these issues in its policy forum in Oct. In early November, I discussed the first article, which was devoted to sharing data and data standards. The second article, listed under the category “Genomics,” focuses on how genomic standards need to evolve with new sequencing technologies.

Drafting By

The premise of the article “Genome Project Standards in a New Era of Sequencing” was to begin a conversation about how to define standards for sequence data quality in this new era of ultra-high throughput DNA sequencing. One of the “easy” things to do with Next Generation Sequencing (NGS) technologies is create draft genome sequences. A draft genomic sequence is defined as a collection of contig sequences that result from one, or a few, assemblies of large numbers of smaller DNA sequences called reads. In traditional Sanger sequencing a read was between 400 and 800 bases in length and came from a single clone, or sub-clone of a large DNA fragment. NGS reads, come from individual molecules in a DNA library and vary between 36 and 800 bases in length depending on the sequencing platform being used (454, Illumina, SOLiD, or Helicos).

A single NGS run can now produce enough data to create a draft assembly for many kinds of organisms with smaller genomes such as viruses, bacteria, and fungi. This makes it possible to create many draft genomes quickly and inexpensively. Indeed the article was accompanied by a figure showing that the current growth of draft sequences exceeds the growth of finished sequences by a significant amount. If this trend continues, the ratio of draft to finished sequences will grow exponentially into the foreseeable future.

Drafty Standards

The primary purpose for a complete genome sequence is to serve as a reference for other kinds of experiments. A well annotated reference is accompanied by a catalog of genes and their functions, as well as an ordering of the genes, regulatory regions, and the sequences needed for evolutionary comparisons that further elucidate genomic structure and function. A problem with draft sequences is that they can contain a large numbers of errors that range from incorrect base calls to more problematic mis-assemblies that place bases or groups of bases in the wrong order. Because, these holes leave some sequences are more drafty than others, they are less useful in fulfilling their purpose as reference data.

If we can describe the draftiness of a genome sequence we may be able to weight its fitness for a specific purpose. The article went on to tackle this problem by recommending a series of qualitative descriptions that describe levels of draft sequences. Beginning with the Standard Draft, or an assembly of contigs of unfiltered data from one or more sequencing platforms, the terms move through High-Quality Draft, to Improved High-Quality Draft, to Annotation-Directed Improvement, to Noncontiguous Finished, to Finished. Finished sequence is defined as less than 1 error per 100,000 bases and each genomic unit (chromosomes or plasmids that are capable of replication) is assembled into a single contig with a minimal number of exceptions. The individuals proposing these standards are a well respected group in the genome community and are working with the database groups and sequence ontology groups to incorporate these new descriptions into data submissions and annotations for data that may be used by others.

Given the high cost and lengthy time required to finish genomic sequences, finishing every genome to a high standard is impractical. If we are going to work with genomes that are finished to varying degrees, systematic ways to describe the quality of the data are needed . This policy recommendations are a good start, but more needs to be done to make the proposed standards useful.

First, standards need to be quantitative. Qualitative descriptions are less useful because they create downstream challenges when reference data are used in automated data processing and interpretation pipelines. As the numbers of available genomes grow into the thousands and tens of thousands, subjective standards make the data more and more cumbersome and difficult to review. Moreover without quantitative assessment, how will one know when they have an average error rate of 1 in 100,000 bases? The authors intentionally avoided recommending numeric thresholds in the proposed standards because the instrumentation and sequencing methodologies are changing rapidly. This may be true, but future discussions nevertheless should focus on quantitative descriptions for that very reason. It is because data collection methods and instrumentation are changing rapidly that we need measures we can compare. This is the new world.

Second, the article fails to address how the different standards might be applied in a practical sense. For example, what can I expect to do with a finished genome that I cannot do with a nearly finished genome? What is a standard draft useful for? How should I trust my results and what might I expect to do to verify a finding? While the article does a good job describing the quality attributes of the data that genome centers might produce, the proposed standards would have broader impact if they could more specifically set expectations of what could be done with data.

Without this understanding, we still won't know when when our data are good enough.

Labels: , , ,