Saturday, September 12, 2009

Sneak Peak: Sequencing the Transcriptome: RNA Applications for Next Generation Sequencing

Join us this coming Wednesday, September 16, 2009 10:00 am Pacific Daylight Time (San Francisco, GMT-07:00), for a webinar on whole transcriptome analysis. In the presentation you will learn about how GeneSifter Analysis Edition can be used to identify novel RNAs and novel splice events within known RNAs.

Abstract:

Next Generation Sequencing applications such as RNA-Seq, Tag Profiling, Whole Transcriptome Sequencing and Small RNA Analysis allow whole genome analysis of coding as well as non-coding RNA at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known RNAs in a sample, these applications are also ideal for the identification of novel RNAs and novel splicing events for known RNAs.

This presentation will provide an overview of the RNA applications using data from the NCBI's GEO database and Short Read Archive with an emphasis on converting raw data into biologically meaningful datasets. Data analysis examples will focus on methods for identifying differentially expressed genes, novel genes, differential splicing and 5’ and 3’ variation in miRNAs.

To register, please visit the event page.

Labels: , , , , , , ,

Tuesday, May 12, 2009

Small RNAs Get Smaller

Tiny RNAs recently joined the growing list of non-coding RNA (ncRNA) molecules [1]. Their absolute function is not understood, but they are possibly a new class of ncRNA and appear to be most associated with transcription of highly expressed genes in human, chickens and Drosophila and possibly others.

This was the conclusion of work published in the May issue of Nature Genetics. Remember when all we had to worry about was the central dogma? DNA was transcribed into RNA and RNA was translated into protein. Life was so simple.

Not really. Even as the first genetic code was being elucidated [2], the possibility of uncovering the second code, that translates nucleic acid sequences into protein sequences was being contemplated [3]. Translating RNA into protein required other kinds of RNA that became known as ribosomal (rRNA) and transfer RNA (tRNA). The RNA between DNA and protein became messenger (m)RNA. In the late seventies, introns were discovered [4,5] and soon to follow were small nuclear (sn)RNAs and “snurps” (small nuclear riboproteins). The snRNAs were further characterized as small nucleolar (snoRNA) and Cajal body-specific (scaRNA) RNAs, and a class of new molecules were investigated for their involvement in mRNA splicing.

As the mechanisms for splicing were being worked out, researchers were able to prove that RNA could also be an enzyme [6]. In this case, the intron is the enzyme responsible for splicing itself out to create the mature mRNA. At the same time, another group discovered that the catalytic unit of RNAase-P, an enzyme involved in converting precursor tRNAs into active tRNAs, is also RNA [7]. Indeed, later work revealed that rRNA in the large ribosome subunit catalyzes the peptidyl transferase reaction to join amino acids together to build proteins [8]. Not only does the central dogma require a multitude of RNA molecules to transcribe DNA into RNA and translate RNA into protein, but the RNA molecules are responsible for carrying the information needed to make proteins and supplying the enzymatic activity to do the work!

What else does RNA do?

More than we can imagine. Starting with the discovery that double stranded RNA (dsRNA) could inhibit gene expression by turning on RNA interference (RNAi) pathways [9], new RNAs were identified, micro (miRNA) and small interfering (siRNA), as essential to the RNAi pathway. miRNA and siRNA were the early members of what would become a large and growing class of RNAs now referred to as non-coding RNAs (ncRNAs).

The ncRNAs represent a next frontier in RNA research and understanding gene expression. Some ncRNAs are large, like lincRNAs (large intervening non-coding RNAs) [10], but most are small between 18 and 31 nt. Within in the small ncRNA group are piwi-interacting (piRNA), repeat associated small interfering (rasiRNA), small temporal (stRNA), and now transcription initiation (tiRNA) RNA. I like tiny RNA.

Tiny, or tiRNAs, were discovered by Next Generation Sequencing (NGS) studies. RNA libraries were prepared from specific size fractions of capped messages. The resulting libraries were sequenced on the Roche FLX Genome Sequencing system and the data were aligned to human genome build 36.1 and compared to transcription start sites (TSS) defined by RefGene (NCBI). The authors reasoned the previous deep-sequencing studies missed these RNA molecules because they tend to be disregarded as low-abundance spurious, or degradation products. However, because they can be cloned, they must have a 5’ phosphate and, when aligned to genomc sequences, the NGS reads cluster in a non-random fashion around TSSs.

GeneSifter enables small RNA research

NGS makes it possible to explore the RNA world in new ways by designing experiments to capture small RNA molecules and sequence them in a massively parallel, high throughput format. However, both the experiments and data analysis are technically challenging. Fortunately GeneSifter Laboratory Edition (GSLE) and GeneSifter Analysis Edition (GSAE) can help. In GSLE you can use the software to track RNA preparation steps and record data at different points of the process. GSAE is accompanied with data analysis pipelines designed to filter artifacts and identify known small RNAs. Post alignment clustering reports, based on coverage in a genome, can be used to further refine results an discover new RNA species as well. Moreover, you can convert the clustering reports into lists of expression values for these RNAs and compare their expression between different samples, tissues, or experimental conditions.


References
1. Taft R.J., Glazov E.A., Cloonan N., et. al., 2009. Tiny RNAs associated with transcription start sites in animals. Nat Genet 41, 572-578.

2. Watson J.D. and Crick F.H.C. Molecular structure of nucleic acids: a structure for deoxyribose nucleic acid. Nature 171, 737-738 (1953)

3. Crick F.H., Barnett L., Brenner S., Watts-Tobin R.J., 1961. General nature of the genetic code for proteins. Nature 192, 1227-1232.

4. Chow L.T., Roberts J.M., Lewis J.B., Broker T.R., 1977. A map of cytoplasmic RNA transcripts from lytic adenovirus type 2, determined by electron microscopy of RNA:DNA hybrids. Cell 11, 819-836.

5. Berk A.J., Sharp P.A., 1977. Sizing and mapping of early adenovirus mRNAs by gel electrophoresis of S1 endonuclease-digested hybrids. Cell 12, 721-732.

6. Zaug A.J., Cech T.R., 1982. The intervening sequence excised from the ribosomal RNA precursor of Tetrahymena contains a 5-terminal guanosine residue not encoded by the DNA. Nucleic Acids Res 10, 2823-2838.

7. Guerrier-Takada C., Gardiner K., Marsh T., Pace N., Altman S., 1983. The RNA moiety of ribonuclease P is the catalytic subunit of the enzyme. Cell 35, 849-857.

8. Nissen P., Hansen J., Ban N., Moore P.B., Steitz T.A., 2000. The structural basis of ribosome activity in peptide bond synthesis. Science 289, 920-930.

9. Fire A., Xu S., Montgomery M.K., Kostas S.A., Driver S.E., Mello C.C., 1998. Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans. Nature 391, 806-811.

10. Guttman M., Amit I., Garber M., French C., Lin M.F., Feldser D., Huarte M., Zuk O., Carey B.W., Cassady J.P., Cabili M.N., Jaenisch R., Mikkelsen T.S., Jacks T., Hacohen N., Bernstein B.E., Kellis M., Regev A., Rinn J.L., Lander E.S., 2009. Chromatin signature reveals over a thousand highly conserved large non-coding RNAs in mammals. Nature 458, 223-227.


Further Reading
ncRNA - http://nar.oxfordjournals.org/cgi/reprint/35/suppl_1/D178
snoRNA - http://en.wikipedia.org/wiki/SnoRNA
siRNA - http://en.wikipedia.org/wiki/SiRNA
miRNA - http://en.wikipedia.org/wiki/MicroRNA
piRNA - http://en.wikipedia.org/wiki/Piwi-interacting_RNA
rasiRNA - http://en.wikipedia.org/wiki/RasiRNA
stRNA - http://jcs.biologists.org/cgi/content/full/116/23/4689
Ribozymes - http://en.wikipedia.org/wiki/Ribozyme


Databases
miRBASE - http://microrna.sanger.ac.uk/sequences/
RNAdb - http://research.imb.uq.edu.au/rnadb/default.aspx

Labels: , , , , ,

Wednesday, March 4, 2009

Bloginar: The Next Generation Dilemma: Large Scale Data Analysis

Previous posts shared some the things we learned at the AGBT and ABRF meetings in early February. Now it is time to share the work we presented, starting with the AGBT poster, “The Next Generation Dilemma: Large Scale Data Analysis.”

The goal of the poster was to provide a general introduction to the power of Next Generation Sequencing (NGS) and a framework for data analysis. Hence, the abstract described the NGS general data analysis process; its issues and what we are doing for one kind of transcription profiling, RNA-Seq. Between then and now we learned a few things... And the project grew.

The map below guides my “bloginar” poster presentation. In keeping with the general theme of the abstract we focused on transcription analysis, but instead of focusing exclusively on RNA-Seq, the project expanded to compare three kinds of transcription profiling: RNA-Seq, Tag Profiling, and Small RNA Analysis. A link to the poster is provided at the end.

Section 1 provides a general introduction into NGS by discussing the ways NGS is being used to study different aspects of molecular biology. It also covers how the data are analyzed in thee phases (primary, secondary, tertiary) to convert raw data into biologically meaningful information. The three phase model has emerged as a common framework to describe the process of converting image data into primary sequence data (reads) and then turning the reads into information that be used in comparative analyses. Secondary analysis is the phase where reads are aligned to reference sequences to get gene names, position, and (or) frequency information that can be used to measure changes, like gene expression, between samples.

The remaining sections of the poster use examples from transcription analysis to illustrate and address the multiple challenges (listed below) that must be overcome to efficiently use NGS.
  • High end infrastructures are needed to manage and work with extremely large data sets
  • Complex, multistep analysis procedures are required to produce meaningful information
  • Multiple reference data are needed to annotate and verify data and sample quality
  • Datasets must be visualized in multiple ways
  • Numerous Internet resources must be used to fill in additional details
  • Multiple datasets must be comparatively analyzed to gain knowledge
Section 2 describes the three different kinds of transcription profiling experiments. This section provides additional background on the methods and what they measure. For example, RNA-Seq and Tag Profiling are commonly used to measure gene expression. In RNA-Seq, DNA libraries are prepared by randomly amplifying short regions of DNA from cDNA. The sequences that are produced will generally cover the entire region of the transcripts that were originally isolated. Hence, it is possible to get information about alternative splicing and biased allelic expression. In contrast, Tag Profiling focuses on creating DNA libraries from discrete points within the RNA molecules. With Tag Profiling, one can quickly measure relative gene expression, but cannot get information about alternative splicing and allelic expression. The table in section 2 discusses these and other issues one must consider when running the different assays.

Sections 3, 4, and 5 outline three transcriptome scenarios (RNA-Seq, Tag Profiling, and Small RNA, respectively) using real data examples (references provided in the poster). Each scenario follows a common workflow involving the preparation of DNA libraries from RNA samples, followed by secondary analysis, followed by tertiary analysis of the data in GeneSifter Analysis Edition.

For RNA-Seq, two datasets corresponding to mouse erythroid stem (ES) and body (EB) cells were investigated. DNA libraries were produced from each cell line. Sequences were collected from the library and compared to the RefSeq (NCBI) database according to the pipeline shown. The screen captures (middle of the panel) show how the individual reads map to each transcript along with the total numbers of hits summarized by chromosome. The process is repeated twice, once for each cell line, and the two sets of alignments are converted to Gene Lists for comparative analysis in GeneSifter laboratory edition to observe differential expression (bottom of the panel).

The Tag Profiling panel examines data from a recently published experiment (a reference is provided in the poster) in which gene expression was studied in transgenic mice. I’ll leave out the details of the paper, and only point out how this example shows the differences between Tag Profiling and RNA-Seq data. Because Tag Profiling collects data from specific 3’ sites in RNA, the aligned data (middle of the panel) show alignments as single “spikes” toward the 3’ end of transcripts. Occasionally multiple peaks are observed. The question being, are the additional peaks the result of isoforms (alternative polyA sites) or incomplete restriction enzyme digests? How might this be sorted out? Like RNA-Seq, the bottom panel shows the comparative analysis of replicate samples from the wild type (WT) and transgenic (TG) mice.

Data from a small RNA analysis experiment are analyzed in the third panel. Unlike RNA-Seq and Tag Profiling, this secondary analysis has more comparisons of the reads to different sets of reference sequences. The purpose is to identify and filter out common artifacts observed in small RNA preparations. The pipeline we used, and data produced, are shown in the middle of the panel. Histogram plots of read length distribution, determined from alignments in different reference sources, are created because an important feature of small RNAs is that they are small. Distributions clustered around 22 nt indicate a good library. Finally, data are linked to additional reports and databases, like miRBase (Sanger Center), to explore results further. In the example shown, the first hit was to a small RNA that has been observed in opossums; now we have human counter part. In total, four, samples were studied. Like RNA-Seq and Tag Profiling, we can observe the relative expression of each small RNA by analyzing the datasets together (hierarchical clustering, bottom).

Section 6 presents some of the challenges of scale issues that accompany NGS, and how we are addressing these issues with HDF5 technology. This will be a topic of many more posts in the future.

We close the poster by addressing the challenges listed above with the final points:
  • High performance data management systems are being developed through the BioHDF project and GeneSifter system architectures.
  • The examples show how each application and sequencing platform requires a different data analysis workflow (pipeline). GeneSifter provides a platform to develop and make bioinformatics pipelines and data readily available to communities of biologists.
  • The transcriptome is complex, different libraries of sequence data can filter known sequences (e.g. rRNA) and discover new elements (miRNAs) and isoforms of expressed genes.
  • Within a dataset, read maps, tables, and histogram plots are needed to summarize and understand the kinds of sequences present and how they relate to an experiment.
  • Links to Entrez Gene, the USCS genome browser, and miRBASE, show how additional information can be integrated into the application framework and used.
  • Next Gen transcriptomics assays are similar to microarray assays in many ways, hence software systems like Geospiza’s GeneSifter are useful for comparative analysis.
You can also get the file, AGBT_2009.pdf

Labels: , , , , , , ,

Sunday, March 1, 2009

Sneak Peak: Small RNA Analysis with Geospiza

Join us this Wednesday, March 4th at 10:00 A.M. PST (1:00 P.M. EST), for a webinar focusing on small RNA analysis. Eric Olson, our VP of Product Development and principle designer of Geospiza’s GeneSifter Analysis Edition will present our latest insights on analyzing large Next Generation Sequencing datasets to study small RNA biology.

Follow the link to register for this interesting presentation.

Abstract

Next Generation Sequencing allows whole genome analysis of small RNAs at an unprecedented level. Current technologies allow for the generation of 200 million data points in a single instrument run. In addition to allowing for the complete characterization of all known small RNAs in a sample, these applications are also ideal for the identification of novel small RNAs. This presentation will provide an overview of micro RNA expression analysis from raw data to biological significance using examples from publicly available datasets and Geospiza’s GeneSifter software.



Labels: , , , , ,