Sunday, March 7, 2010

AGBT Round Up

This year's AGBT conference created a lot of excitement in the sequencing community.  It's a been a week since the show, so everyone has had a chance to write up their blogs and news.

AGBT - Advances in Genome Biology and Technology

As the name implies, the AGBT conference focuses on genomics technologies and how they are applied to study biology.  Conference sessions cover the a gamut of new genomics-based discoveries, new technologies, and informatics.  The predominant technology used in genomics research is DNA sequencing, hence a large portion of the conference is devoted to learning how next generation sequencing (NGS) instruments are improving and how new instruments will change the NGS landscape.  Because informatics is so important in NGS, the conference is attended by a lot of bioinformatics specialists who like to blog and communicate what they are learning in real time through twitter.  Links to their posts are listed below.

Blogs other summarized coverage

BioTechniques summary of single molecule sequencing (http://bit.ly/cjzth1).

Anthony Fejes' conference notes. Great read, lots of detail. (http://is.gd/9vmJX).

Genetic Inference summarizes instruments, talks, and speculates on single molecule sequencing (http://bit.ly/cWJyo7).

Genetic Future's coverage of the new sequencing instruments (http://bit.ly/d1UxZg).

MassGenomics' coverage of the cancer genomics session (http://bit.ly/cImXxZ).

The above sites also have other posts sharing the author's perspectives on instruments and companies working in the NGS space.

Raw Data

For those interested in the blow by blow tweets as they occurred in real time, visit twitter and search on #AGBT.

Labels: ,

Tuesday, February 23, 2010

GeneSifter Lab Edition v3.14 - Release Notes

GeneSifter Laboratory Edition (GSLE) 3.14.0 introduces a host of new features and capabilities that make daily laboratory data management work even easier.  Read below to learn why GSLE is a leading LIMS product for all forms of DNA sequencing, microarrays, and other genetic analysis applications.

Orders and Invoices

Multi plate submissions: Order forms have been extended in several ways to further simplify how labs collect sample and project information. A new order form template lets core facilities, managing larger sequencing projects, easily receive samples and their information in a multiple plate format. New order fields specific to the plate format are included to support sample tracking and lab work.

Add data to fields: Orders forms have been further improved by adding the ability to add new values (or terms) to dropdown fields that already exist on published order forms.


Project field: Additionally, labs can add an optional project field to forms. With these improvements, labs can create forms that are easier to use and modify, as well as enable project tracking for their customers.

Sample location and sample selection: Two new features deliver help for labs that provide sample storage (biobanking) services to their clients. First, order forms can include sample location information. This is particularly useful in situations where samples are delivered in 96-well plates that are stored for later use. Second, samples already stored by the lab as purified DNA, RNA or other material (templates) can be selected from specialized search interfaces within order forms. Like all GSLE sample entry forms, these features can be included or not on a case-by-case basis depending on your specific needs. 

Invoice formatting: For labs that have the dreaded chore of sending billing data to accounting departments we have added the ability to modify the invoice number format to include additional characters that are used to distinguish which labs are sending information.

Laboratory Operations


GSLE provides the ability to create, list and follow steps in sample protocols (also called workflows). In 3.14 new features not only expand the capabilities but make it possible to further standardize procedures. 


Multiplexing: In Next Generation Sequencing (NGS) several libraries are often combined into a single lane or region of a slide to increase the number of individual samples analyzed in a sequencing run. As each library is prepared, a specific adaptor sequence is added so sequence reads corresponding to different samples can be identified by their adaptor tag. This procedure, called multiplexing or barcoding, is supported in 3.14 and allows the lab to combine samples and adaptor sequences and group the combination of libraries together (Worksets) for sample processing and instrument runs. Once data are collected, sample naming conventions, combined with adaptor sequence (Multiplex Identifier, MID) stored in sample sheets, are used to separate individual reads into files corresponding to the samples that were in the original workset.

Batch data entry: Some lab processes require that samples are manipulated in groups (batches), but laboratory data are collected for individual samples within the batch. For example, the concentrations of individual DNA samples may need to be measured in a 96-well plate. To improve how the OD values, comments, or other information are entered, workflow steps have been updated to include batch data entry forms that provide spreadsheet like data entry capabilities. Like all GSLE batch data entry forms, data can be entered easily using the form’s column highlight and easy fill controls, or uploaded from an excel spreadsheet.

Subsample processing: GSLE 3.14 also increases sample processing flexibility. As noted above, order forms can now support the ability to select samples that are already stored in the system. This feature is further extended into the laboratory by creating tools that allow many new samples to be created from a “parent” or stock samples. When new samples (templates) are created, options are provided so that each new sample can be entered into a different process. For example, you receive a tissue sample that needs several experiments performed; RNA-Seq, ChIP-Seq and resequencing. Now you can easily pick the sample and create three new sub samples defining which process will be performed on each sample with just a few clicks.

Selecting samples based on custom data: Some labs need to use custom data entered into order forms to sort and filter samples in the lab. For example, an order form may ask a researcher to enter read lengths for their NGS run. A 36 base run is much faster than a 100 base run, and on some platforms costs less. Thus, the lab will sort samples based on read length prior to the data collection event. While always possible to get this information in many GSLE displays, 3.14 adds new capabilities to use any custom data in its specialized sample picker tools.

Other Features

Customer data management: GSLE v3.14 gives labs’ customers increased ability to organize their chromatograms, fragment analysis files and microarray files as needed. Data files can be edited, relabeled, moved or deleted. Projects and folders can be created, modified or deleted to aid in data organization.

Application Programming Interface (Onsite Installations Only)

SQL-API: As automation and system integration needs increase, requirements for supporting programmatic data entry become more important. GSLE has continued to expand the self-documenting Application Programming Interface (API). We have also added an SQL API that can be used to create custom reports that are accessed via a wget style unix command.


Input API enhancements: The Input API now returns success IDs and CGI parameter names have been eliminated. The full documentation can be reviewed by contacting support@geospiza.com for the GSLE SQL API Manual or the GSLE Input API Manual. 


Next Generation Analysis Transfer Tool (Hosted Partners Only)

Simplified data transfers: A data transfer interface has been added to connect GSLE and GeneSifter Analysis Edition (GSAE). Partner Program administrators use the interface to select data files in GSLE and transfer them to their customer’s account in GSAE.

Schema Table update note


There was an update to an existing schema table;  the column "Plate_Label" is now in table om_sample_plate instead of om_order.

Labels: , , , , ,

Wednesday, February 17, 2010

Standardizing the Next Generation of Bioinformatics Software Development With BioHDF (HDF5)

AGBT is next week, and well be there presenting a poster on our latest and greatest work with HDF5 and BioHDF tools. For those of you attending, check out the poster. For those unable to attend, check back later for the "Bloginar."

Abstract

Next Generation Sequencing technologies are powerful tools for rapidly sequencing genomes and studying functional genomics. However, the lack of scalable data analysis capabilities limits their potential. Future bioinformatics applications need to be developed on common standard infrastructures that can reduce overall data storage, increase data processing performance, integrate information from multiple sources and are self-describing. HDF technologies meet all of these requirements, have a long history, and are widely used in data-intensive science communities. They consist of general data file formats, software libraries and tools for manipulating the data. Compared to emerging standards such as the SAM/BAM formats, HDF5-based systems demonstrate improved I/O performance and methods to reduce data storage. HDF5 is also more extensible and can support multiple data indexes and store multiple data types. For these reasons, HDF5 and its BioHDF implementation are well qualified as standards for implementing data models in binary formats to support the next generation of bioinformatics applications.

In the poster we will present:
  1. An overview of NGS data analysis and workflows
  2. A prototype data model for working with NGS data
  3. Practical examples of data analysis and viewing information using the underlying framework
  4. Performance benchmarks comparing HDF5 to other file formats 

Labels: , ,

Wednesday, February 3, 2010

Sneak Peak: Data Analysis Methods for Whole Transcriptome Sequencing Applications – Challenges and Solutions

RNA sequencing is one of the most popular Next Generation Sequencing (NGS) applications. Next Thursday, February 11 at 10:00 A.M. PDT (1:00 P.M. EDT), we kick off our 2010 webinar series with a presentation designed to help you understand whole transcriptome data analysis and what can be learned in these experiments. In addition, we will show off some of our latest tools and interfaces that can be used to discover new RNAs, new splice forms of transcripts, and alleles of expressed genes.

Summary

RNA sequencing applications such as Whole Transcriptome Analysis, Tag Profiling and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 500 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample (gene level expression summaries, exon usage, splice junction, single nucleotide variants, insertions and deletions), these applications are also ideal for the identification of novel RNAs as well as novel splicing events.

This presentation will provide an overview of Whole Transcriptome data analysis workflows with emphasis on calculating gene and exon level expression values as well as identifying splice junctions and variants from short read data. Comparisons of multiple groups to identify differential gene expression as well as differential splicing will also be discussed. Using data drawn from the GEO data repository and Short Read Archive (SRA), analysis examples will be presented for both Illumina’s GA and Lifetech’s SOLiD instruments.

Register Today!

Labels: , , , ,

Sunday, January 31, 2010

Next Generation Sequencing: Deep and Fast!

The Next Generation Sequencing field is fast moving. A simple NGS search in PubMed already yields 51 articles for 2010. At his rate there will be over 600 papers this year that use NGS, review NGS progress, or introduce new NGS analysis methods.

The references retrieved from the search are listed below.

1: Marguerat S, Bahler J. RNA-seq: from technology to biology. Cell Mol Life Sci. 2010 Feb;67(4):569-79. Epub 2009 Oct 27. Review. PubMed PMID: 19859660.

2: Mamanova L, Coffey AJ, Scott CE, Kozarewa I, Turner EH, Kumar A, Howard E, Shendure J, Turner DJ. Target-enrichment strategies for next-generationsequencing. Nat Methods. 2010 Feb;7(2):111-8. PubMed PMID: 20111037.

3: Jex AR, Hall RS, Littlewood DT, Gasser RB. An integrated pipeline for next-generation sequencing and annotation of mitochondrial genomes. Nucleic Acids Res. 2010 Feb;38(2):522-33. Epub 2009 Nov 5. PubMed PMID: 19892826; PubMed Central PMCID: PMC2811008.

4: Quinlan AR, Hall IM. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics. 2010 Jan 28. [Epub ahead of print] PubMed PMID: 20110278.

5: Gottlieb B, Beitel LK, Alvarado C, Trifiro MA. Selection and mutation in the "new" genetics: an emerging hypothesis. Hum Genet. 2010 Jan 23. [Epub ahead of print] PubMed PMID: 20099069.

6: Monger WA, Alicai T, Ndunguru J, Kinyua ZM, Potts M, Reeder RH, Miano DW, Adams IP, Boonham N, Glover RH, Smith J. The complete genome sequence of the Tanzanian strain of Cassava brown streak virus and comparison with the Ugandan strain sequence. Arch Virol. 2010 Jan 22. [Epub ahead of print] PubMed PMID: 20094895.

7: Popp C, Dean W, Feng S, Cokus SJ, Andrews S, Pellegrini M, Jacobsen SE, Reik W. Genome-wide erasure of DNA methylation in mouse primordial germ cells is affected by AID deficiency. Nature. 2010 Jan 22. [Epub ahead of print] PubMed PMID: 20098412.

8: Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, Zhang Z, Zhang Y, Wang W, Li J, Wei F, Li H, Jian M, Li J, Zhang Z, Nielsen R, Li D, Gu W, Yang Z, Xuan Z, Ryder OA, Leung FC, Zhou Y, Cao J, Sun X, Fu Y, Fang X, Guo X, Wang B, Hou R, Shen F, Mu B, Ni P, Lin R, Qian W, Wang G, Yu C, Nie W, Wang J, Wu Z, Liang H, Min J, Wu Q, Cheng S, Ruan J, Wang M, Shi Z, Wen M, Liu B, Ren X, Zheng H, Dong D, Cook K, Shan G, Zhang H, Kosiol C, Xie X, Lu Z, Zheng H, Li Y, Steiner CC, Lam TT, Lin S, Zhang Q, Li G, Tian J, Gong T, Liu H, Zhang D, Fang L, Ye C, Zhang J, Hu W, Xu A, Ren Y, Zhang G, Bruford MW, Li Q, Ma L, Guo Y, An N, Hu Y, Zheng Y, Shi Y, Li Z, Liu Q, Chen Y, Zhao J, Qu N, Zhao S, Tian F, Wang X, Wang H, Xu L, Liu X, Vinar T, Wang Y, Lam TW, Yiu SM, Liu S, Zhang H, Li D, Huang Y, Wang X, Yang G, Jiang Z, Wang J, Qin N, Li L, Li J, Bolund L, Kristiansen K, Wong GK, Olson M, Zhang X, Li S, Yang H, Wang J, Wang J. The sequence and de novo assembly of the giant panda genome. Nature. 2010 Jan 21;463(7279):311-7. Epub 2009 Dec 13. PubMed PMID: 20010809.

9: Falgueras J, Lara AJ, Fernandez-Pozo N, Canton FR, Perez-Trabado G, Claros MG. SeqTrim: a high-throughput pipeline for preprocessing any type of sequence reads. BMC Bioinformatics. 2010 Jan 20;11(1):38. [Epub ahead of print] PubMed PMID: 20089148.

10: Beck AH, Weng Z, Witten DM, Zhu S, Foley JW, Lacroute P, Smith CL, Tibshirani R, van de Rijn M, Sidow A, West RB. 3'-end sequencing for expression quantification (3SEQ) from archival tumor samples. PLoS One. 2010 Jan 19;5(1):e8768. PubMed PMID: 20098735; PubMed Central PMCID: PMC2808244. 11: Webb KM, Rosenthal BM. Deep resequencing of Trichinella spiralis reveals previously un-described single nucleotide polymorphisms and intra-isolate variation within the mitochondrial genome. Infect Genet Evol. 2010 Jan 18. [Epub ahead of print] PubMed PMID: 20083232.

12: Galvez S, Diaz D, Hernandez P, Esteban FJ, Caballero JA, Dorado G. Next-Generation Bioinformatics: Using Many-Core Processor Architecture to Develop a Web Service for Sequence Alignment. Bioinformatics. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20081221.

13: Gilchrist E, Haughn G. Reverse genetics techniques: engineering loss and gain of gene function in plants. Brief Funct Genomic Proteomic. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20081218.

14: Lavoie PM, Dube MP. Genetics of bronchopulmonary dysplasia in the age of genomics. Curr Opin Pediatr. 2010 Jan 16. [Epub ahead of print] PubMed PMID: 20087186.

15: Hyten DL, Cannon SB, Song Q, Weeks N, Fickus EW, Shoemaker RC, Specht JE, Farmer AD, May GD, Cregan PB. High-throughput SNP discovery through deep resequencing of a reduced representation library to anchor and orient scaffolds in the soybean whole genome sequence. BMC Genomics. 2010 Jan 15;11(1):38. [Epub ahead of print] PubMed PMID: 20078886.

16: Pleasance ED, Stephens PJ, O'Meara S, McBride DJ, Meynert A, Jones D, Lin ML, Beare D, Lau KW, Greenman C, Varela I, Nik-Zainal S, Davies HR, Ordonez GR, Mudie LJ, Latimer C, Edkins S, Stebbings L, Chen L, Jia M, Leroy C, Marshall J, Menzies A, Butler A, Teague JW, Mangion J, Sun YA, McLaughlin SF, Peckham HE, Tsung EF, Costa GL, Lee CC, Minna JD, Gazdar A, Birney E, Rhodes MD, McKernan KJ, Stratton MR, Futreal PA, Campbell PJ. A small-cell lung cancer genome with complex signatures of tobacco exposure. Nature. 2010 Jan 14;463(7278):184-90. Epub 2009 Dec 16. PubMed PMID: 20016488.

17: Myles S, Chia JM, Hurwitz B, Simon C, Zhong GY, Buckler E, Ware D. Rapid genomic characterization of the genus vitis. PLoS One. 2010 Jan 13;5(1):e8219. PubMed PMID: 20084295; PubMed Central PMCID: PMC2805708.

18: Santuari L, Pradervand S, Amiguet-Vercher AM, Thomas J, Dorcey E, Harshman K, Xenarios I, Juenger TE, Hardtke CS. Substantial deletion overlap among divergent Arabidopsis genomes revealed by intersection of short reads and tiling arrays. Genome Biol. 2010 Jan 12;11(1):R4. [Epub ahead of print] PubMed PMID: 20067627.

19: Pool J, Hellmann I, Jensen J, Nielsen R. Population genetics of genome-scale sequence variation. Genome Res. 2010 Jan 12. [Epub ahead of print] PubMed PMID: 20067940.

20: Duncan EL, Brown MA. Mapping genes for osteoporosis-Old dogs and new tricks. Bone. 2010 Jan 11. [Epub ahead of print] PubMed PMID: 20060943.

21: Byrne SL, Durandeau K, Nagy I, Barth S. Identification of ABC transporters from Lolium perenne L. that are regulated by toxic levels of selenium. Planta. 2010 Jan 9. [Epub ahead of print] PubMed PMID: 20063009.

22: Zhao CZ, Xia H, Frazier TP, Yao YY, Bi YP, Li AQ, Li MJ, Li CS, Zhang BH, Wang XJ. Deep sequencing identifies novel and conserved microRNAs in peanut (Arachis hypogaea L.). BMC Plant Biol. 2010 Jan 5;10(1):3. [Epub ahead of print] PubMed PMID: 20047695.

23: Medina M, Sachs JL. Symbiont genomics, our new tangled bank. Genomics. 2010 Jan 4. [Epub ahead of print] PubMed PMID: 20053372.

24: Hittinger CT, Johnston M, Tossberg JT, Rokas A. Leveraging skewed transcript abundance by RNA-Seq to increase the genomic depth of the tree of life. Proc Natl Acad Sci U S A. 2010 Jan 4. [Epub ahead of print] PubMed PMID: 20080632.

25: Volpi L, Roversi G, Colombo EA, Leijsten N, Concolino D, Calabria A, Mencarelli MA, Fimiani M, Macciardi F, Pfundt R, Schoenmakers EF, Larizza L. Targeted next-generation sequencing appoints c16orf57 as clericuzio-type poikiloderma with neutropenia gene. Am J Hum Genet. 2010 Jan;86(1):72-6. Epub 2009 Dec 10. PubMed PMID: 20004881.

26: Stankiewicz P, Lupski JR. Structural variation in the human genome and its role in disease. Annu Rev Med. 2010;61:437-55. PubMed PMID: 20059347.

27: Clement NL, Snell Q, Clement MJ, Hollenhorst PC, Purwar J, Graves BJ, Cairns
BR, Johnson WE. The GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing. Bioinformatics. 2010 Jan 1;26(1):38-45. Epub 2009 Oct 27. PubMed PMID: 19861355.

28: Arner E, Hayashizaki Y, Daub CO. NGSView: an extensible open source editor for next-generation sequencing data. Bioinformatics. 2010 Jan 1;26(1):125-6. Epub 2009 Oct 24. PubMed PMID: 19855106; PubMed Central PMCID: PMC2796816.

29: Jex AR, Littlewood DT, Gasser RB. Toward next-generation sequencing of mitochondrial genomes--focus on parasitic worms of animals and biotechnological implications. Biotechnol Adv. 2010 Jan-Feb;28(1):151-9. Epub . Review. PubMed PMID: 19913084.

30: Jex AR, Gasser RB. Genetic richness and diversity in Cryptosporidium hominis
and C. parvum reveals major knowledge gaps and a need for the application of "next generation" technologies--research review. Biotechnol Adv. 2010 Jan-Feb;28(1):17-26. Epub . Review. PubMed PMID: 19699288.

31: Chou LS, Liu CS, Boese B, Zhang X, Mao R. DNA sequence capture and enrichment by microarray followed by next-generation sequencing for targeted resequencing: neurofibromatosis type 1 gene as a model. Clin Chem. 2010 Jan;56(1):62-72. Epub 2009 Nov 12. PubMed PMID: 19910506.

32: Nagalakshmi U, Waern K, Snyder M. RNA-Seq: a method for comprehensive transcriptome analysis. Curr Protoc Mol Biol. 2010 Jan;Chapter 4:Unit 4.11.1-13. PubMed PMID: 20069539.

33: Fullwood MJ, Han Y, Wei CL, Ruan X, Ruan Y. Chromatin interaction analysis using paired-end tag sequencing. Curr Protoc Mol Biol. 2010 Jan;Chapter 21:Unit 21.15.1-25. PubMed PMID: 20069536.

34: Roukos DH. Novel clinico-genome network modeling for revolutionizing genotype-phenotype-based personalized cancer care. Expert Rev Mol Diagn. 2010 Jan;10(1):33-48. PubMed PMID: 20014921.

35: Liu S, Chen HD, Makarevitch I, Shirmer R, Emrich SJ, Dietrich CR, Barbazuk WB, Springer NM, Schnable PS. High-throughput genetic mapping of mutants via quantitative single nucleotide polymorphism typing. Genetics. 2010 Jan;184(1):19-26. Epub 2009 Nov 2. PubMed PMID: 19884313.

36: Day IN. dbSNP in the detail and copy number complexities. Hum Mutat. 2010 Jan;31(1):2-4. PubMed PMID: 20024941.

37: Hamady M, Lozupone C, Knight R. Fast UniFrac: facilitating high-throughput phylogenetic analyses of microbial communities including analysis of pyrosequencing and PhyloChip data. ISME J. 2010 Jan;4(1):17-27. Epub 2009 Aug 27. PubMed PMID: 19710709; PubMed Central PMCID: PMC2797552.

38: Aparicio SA, Huntsman DG. Does massively parallel DNA resequencing signify the end of histopathology as we know it? J Pathol. 2010 Jan;220(2):307-15. PubMed PMID: 19921711.

39: Bell DW. Our changing view of the genomic landscape of cancer. J Pathol. 2010 Jan;220(2):231-43. PubMed PMID: 19918804.

40: Nobuta K, McCormick K, Nakano M, Meyers BC. Bioinformatics analysis of small RNAs in plants using next generation sequencing technologies. Methods Mol Biol. 2010;592:89-106. PubMed PMID: 19802591.

41: Salmon A, Ainouche ML. Polyploidy and DNA methylation: new tools available. Mol Ecol. 2010 Jan;19(2):213-5. PubMed PMID: 20078770.

42: Gathering clouds and a sequencing storm: why cloud computing could broaden community access to next-generation sequencing. Nat Biotechnol. 2010 Jan;28(1):1. PubMed PMID: 20062015.

43: Metzker ML. Sequencing technologies - the next generation. Nat Rev Genet. 2010 Jan;11(1):31-46. Epub 2009 Dec 8. Review. PubMed PMID: 19997069.

44: Northcott PA, Rutka JT, Taylor MD. Genomics of medulloblastoma: from Giemsa-banding to next-generation sequencing in 20 years. Neurosurg Focus. 2010 Jan;28(1):E6. PubMed PMID: 20043721.

45: Yang JH, Shao P, Zhou H, Chen YQ, Qu LH. deepBase: a database for deeply annotating and mining deep sequencing data. Nucleic Acids Res. 2010 Jan;38(Database issue):D123-30. Epub 2009 Dec 4. PubMed PMID: 19966272; PubMed Central PMCID: PMC2808990.

46: Shumway M, Cochrane G, Sugawara H. Archiving next generation sequencing data. Nucleic Acids Res. 2010 Jan;38(Database issue):D870-1. Epub 2009 Dec 3. PubMed PMID: 19965774; PubMed Central PMCID: PMC2808927.

47: Brooksbank C, Cameron G, Thornton J. The European Bioinformatics Institute's
data resources. Nucleic Acids Res. 2010 Jan;38(Database issue):D17-25. Epub 2009 Nov 24. PubMed PMID: 19934258; PubMed Central PMCID: PMC2808956.

48: Kim P, Yoon S, Kim N, Lee S, Ko M, Lee H, Kang H, Kim J, Lee S. ChimerDB 2.0--a knowledgebase for fusion genes updated. Nucleic Acids Res. 2010 Jan;38(Database issue):D81-5. Epub 2009 Nov 11. PubMed PMID: 19906715; PubMed Central PMCID: PMC2808913.

49: Leinonen R, Akhtar R, Birney E, Bonfield J, Bower L, Corbett M, Cheng Y, Demiralp F, Faruque N, Goodgame N, Gibson R, Hoad G, Hunter C, Jang M, Leonard S, Lin Q, Lopez R, Maguire M, McWilliam H, Plaister S, Radhakrishnan R, Sobhany S, Slater G, Ten Hoopen P, Valentin F, Vaughan R, Zalunin V, Zerbino D, Cochrane G. Improvements to services at the European Nucleotide Archive. Nucleic Acids Res. 2010 Jan;38(Database issue):D39-45. Epub 2009 Nov 11. PubMed PMID: 19906712; PubMed Central PMCID: PMC2808951.

50: Kaminuma E, Mashima J, Kodama Y, Gojobori T, Ogasawara O, Okubo K, Takagi T, Nakamura Y. DDBJ launches a new archive database with analytical tools for next-generation sequence data. Nucleic Acids Res. 2010 Jan;38(Database issue):D33-8. Epub 2009 Oct 22. PubMed PMID: 19850725; PubMed Central PMCID:
PMC2808917.

51: Jing H. [Advances in approaches for the quantitative detection of microRNAs.]. Yi Chuan. 2010 Jan;32(1):31-40. Chinese. PubMed PMID: 20085883.

Monday, January 25, 2010

Grant Opportunities for Next Generation DNA Sequencing

As we close the first month of 2010, it is time to get your pencils sharpened and submit proposals for new shared instruments.

The National Center for Research Resources (NCRR) announced that it has $43M to fund equipment purchases in 2011. With this money, NCRR expects to make approximately 125 new award for instruments that cost at least $100,000 but less than $600,000. NCRR proposals are due March 23, 2010.

In addition to NCRR, the National Science Foundation (NSF), through its Major Research Instrumentation (MRI) program, has $90M to make 150 awards of between $100,000 and $4M for shared instrumentation. MRI proposals are due April 21, 2010.

Remember, when preparing proposals a sound informatics plan will make your application stand out. Contact us if you’d like more information.

Labels: , ,

Monday, January 18, 2010

Systems Biology with HDF5

As many are aware, Geospiza and The HDF Group are collaborating to extend HDF (Hierarchical Data Format) technologies to support the data management needs of high performance computing applications in genomics. As we do this work, others are also adopting HDF5 as a data storage technology to work with different kinds of biological data.

The Association for Computing Machinery (ACM) recently published an article, "Unifying Biological Image Formats with HDF5," that argues for using HDF5 and HDF tools as a common framework for working with image files. This article is worth reading for several reasons.

First, it provides a nice introduction and background to HDF5, its origins, and movement towards becoming an ISO standard. HDF5's technical features are also included in this discussion.

Next, a brief history of the imaging community is covered to share how X-ray crystallographers, electron, and optical microscopists had all independently considered HDF5 as a framework for their next-generation image file formats. Through this discussion, the challenges that have been identified within the imaging community are listed.

Like genomics, the amounts of data being collected are ever increasing, current formats are inflexible and difficult to adapt to future modalities and dimensionality, and the nonarchival quality of data undermines long-term value. That is, current data typically lack sufficient metadata about their origins and experiments to be useful in the long-term.

The article goes on to make the point that current challenges with image data could be addressed if the community adopts an existing format that can support both generic and specialized data formats and meet a set of common requirements related to performance, interoperability, and archiving. Examples of how HDF5 meets these requirements are included. Briefly, HDF5's data caching can be used to overcome computation bottlenecks related to the fact that image sizes are exceeding RAM capacity. Interoperability issues can be addressed through HDF5's ability to store multiple metadata schemas in flexible ways. And, because HDF5 is self describing, data stored in HDF5 can be better preserved.

Finally, a barrier to moving to a new technology is supporting legacy applications that may be costly to replace. Thus, the article closes with a creative proposal for supporting legacy software applications and recommendations for future development. HDF5 files could support legacy software applications if they were able to present the data, stored within the HDF5 file, as the collection of directories and files required by the legacy application. This could be accomplished by developing an abstraction layer that could interact with FUSE (Filesystem in User Space) and essentially mount the HDF5 file as a virtual file system. Such a scenario is only possible because data are stored in HDF5 in a general way that can be further abstracted and presented in multiple specific ways.

While this article focused on issues related to image formats, there are many parallels that the genomics and Next Generation Sequencing communities should pay attention to, and if you are a bioinformatics software developer or running bioinformatics projects, you should put this paper on your must read list.

Labels: , , ,