Friday, April 25, 2008

Managing Digital Gene Expression Workflows with FinchLab

Last Wed (4/23) Illumina hosted a Geospiza presentation featuring how FinchLab supports mRNA tag profiling experiments. We had a great turnout and the presentation is posted on the Illumina web site.

In the webninar we talked about:
  • Next Gen sequencing applications
  • How the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive by looking at some features of mRNA Tag Profiling data sets with FinchLab
  • Setting up and tracking laboratory workflows with FinchLab
  • Why it is important to link the laboratory work and data analysis work
  • Setting up data analysis and reviewing results with FinchLab
  • Using hosted solutions to overcome the significant data management challenges that accompany Next Gen technologies
Over the coming weeks and months we'll explore the above points through multiple posts. In the meantime, get the presentation and enjoy.

From Sample to Results: Managing Illumina Data Workflow with FinchLab

Labels: , , , , , , , ,

Monday, April 21, 2008

Sneak Peak: Managing Next Gen Digital Gene Expression Workflows

This Wednesday, April 23rd, Illumina will host a webinar featuring the Geospiza FinchLab.

If you are interested in:
  • Learning about Next Gen sequencing applications
  • Seeing how the Illumina Genome Analyzer makes mRNA Tag Profiling more sensitive
  • Understanding the flow of data and information as samples are converted into results
  • Overcoming the significant data management challenges that accompany Next Gen technologies
  • Setting up Next Gen sequencing in your core lab
  • Creating a new lab with Next Gen technologies
This webinar is for you!

In the webinar, we will talk about the general applications of Next Gen sequencing and focus on using the Illumina Genome Analyzer to perform Digital Gene Expression experiments by highlighting mRNA Tag Profiling. Throughout the talk we will give specific examples about collecting and analyzing tag profiling data and show how the Geospiza FinchLab solves challenges related to laboratory setup and managing Next Gen data and analysis workflows.

Labels: , , , , ,

Monday, April 14, 2008

Digital Gene Expression with Next Gen Sequencing

Next Gen Sequencing is changing how we approach problems ranging from whole genome shotgun sequencing, to variation analysis, to gene expression, to structural genomics. Next week, April 23rd, Geospiza will present a webinar on managing Digital Gene Expression experiments and data with FinchLab. The webinar is hosted by Illumina as part of their ongoing webinar series on Next Gen sequencing.

Abstract

Next Gen sequencers enable researchers to perform new and exciting experiments like digital gene expression. Next Gen sequencers, however, also expose researchers to unprecedented experimental data volume and the need for new tools to support these projects. A single run of the Illumina Genome Analyzer, for example, can generate terabytes of data and 100s of thousands of files. To manage these projects effectively, researchers will need new software systems to quickly track samples, access and analyze the key results files produced by these runs and focus on the science, rather than IT.

In this webinar, Geospiza will demonstrate how the FinchLab Next Gen Edition workflow software can be used track samples, quality review data, and characterize the biological significance of an Illumina dataset while streamlining the entire process from sample to result for a Digital Gene Expression experiment.

Hope to see you there.

Labels: , , , , , ,

Wednesday, April 2, 2008

Working with Workflows

Genetic analysis workflows involve both complex laboratory and data analysis and manipulation procedures. A good workflow management system not only tracks processes, but simplifies the work.

In my last post , I introduced the concept of workflows in describing the issues one needs to think about as they prepare their lab for Next Gen sequencing. To better understand these challenges, we can learn from previous experience with Sanger sequencing in particular and genetic assays in general.

As we know, DNA sequencing serves many purposes. New genomes and genes in the environment are characterized and identified by De Novo sequencing. Gene expression can be assessed by measuring Expressed Sequence Tags (ESTs), and DNA variation and structure can be investigated by resequencing regions of known genomes. We also know that gene expression and genetic variation can also be studied with multiple technologies such as hybridization, fragment analysis, and direct genotyping and it is desirable to use multiple methods to confirm results. Within each of these general applications and technology platforms, specific laboratory and bioinformatics workflows are used to prepare samples, determine data quality, study biology, and predict biological outcomes.

The process begins in the laboratory.

Recently I came across a Wikipedia article on DNA sequencing that had a simple diagram showing the flow of materials from samples to data. I liked this diagram, so I reproduced it, with modifications. We begin with the sample. A sample is a general term that describes a biological material. Sometimes, like when you are at the doctor, these are called specimens. Since biology is all around and in us, samples come from anything that we can extract DNA or RNA from. Blood, organ tissue, hair, leaves, bananas, oysters, cultured cells, feces, you-can-image-what-else, can all be samples for genetic analysis. I know a guy who uses a 22 to collect the apical meristems from trees to study poplar genetics. Samples come from anywhere.

With our samples in hand, we can perform genetic analyses. What we do next depends on what we want to learn. If we want to sequence a genome we're going to prepare a DNA library by randomly shearing the genomic DNA and cloning the fragments into sequencing vectors. The purified cloned DNA templates are sequenced and the data we obtain are assembled into larger sequences (contigs) until, hopefully, we have a complete genome. In resequencing and other genetic assays, DNA templates are prepared from sample DNA by amplifying specific regions of a genome with PCR. The PCR products, amplicons, are sequenced and the resulting data are compared to a reference sequence to identify differences. Gene expression (EST and hybridization) analysis follows similar patterns except that RNA is purified from samples and then converted to cDNA using RT-PCR (Reverse Transcriptase PCR, not Real Time PCR - that's a genetic assay).

From a workflow point of view, we can see how the physical materials change throughout the process. Sample material is converted to DNA or RNA (nucleic acids), and the nucleic acids are further manipulated to create templates that are used for the analytical reaction (DNA sequencing, fragment analysis, RealTime-PCR, ...). As the materials flow through the lab, they're manipulated in a variety of containers. A process may begin with a sample in a tube, use a petri plate to isolate bacterial colonies, 96-well plates to purify DNA and perform reactions, and 384-well plates to collect sequence data. The movement of the materials must be tracked, along with their hierarchical relationships. A sample may have many templates that are analyzed, or a template may have multiple analyses. When we do this a lot we need a way to see where our samples are in their particular processes. We need a workflow management system, like FinchLab.

Labels: , , , , , , ,

Thursday, March 13, 2008

What's a Bustard?

For that matter, what's a Firecrest? or a Gerald? Many with an Illumina Genome Analyzer are now learning these are the directories that have the data they may be interested in.

What's in those directories?

In this post, we explore some of the data in the directories, talk about what data might be important, and use FinchLab Next Gen Edition (FinchLab NG) to look at some of the files. In the Next Gen world we are also going to be learning about the data life cycle. When you are thinking about how to store three or four or ten terabytes (TB) of data for each run, and considering that you might run your instrument 40 or 50 times or more in the next year, you might stop and ask the question, "how much of that data is really important and for how long?" That's the data life cycle. It's going to be important.

To begin our understanding, let's look at the data being created in a run. When an Illumina Genome Analyzer (aka Solexa) collects data, many things happen. First, images are collected for each cycle in a run and tile in a lane on a slide. They're pretty small, but there are a lot, maybe 360,000 or so and they add up to the terabytes we talk about. These images are analyzed to create tens of thousands (about 100 gigabytes [GB] worth) of "raw intensity files" that go in the Firecrest directory. Next, a base-calling algorithm reads the raw intensity files to create sequence, quality and other files (about 80 GB worth) that go in the Bustard directory. The last step is the Eland program pipleline. It reads the Bustard files, aligns their data to reference sequences, makes more quality calculations, and creates more files. These data go in the Gerald directory to give about 25 or 30 GB of sequence and quality data.

So, what's the best data to work with? That depends on what problem you are trying to solve. Specialists developing new basecalling tools or alignment tools might focus on the data in Firecrest and Bustard. Most researchers, however, are going to work with data in the Gerald directory. That reduces our TB problem down to a tens of GB problem. That's a big difference!

FinchLab NG can help.

FinchLab NG gives you the LIMS capabilities to run your Next Gen laboratory workflows and track which samples go on which slides and where on the slide the samples go. We call this part the run. When a run is complete you can link the data on your filesystem to FinchLab NG and use the web interfaces to explore the data. You can also link specific data files to samples. So, if you are sharing data or operating a core lab your researchers can easily access their data through their Finch account.

The screen shot below gives an example of how the HTML quality files can be explored. It shows two windows, the one on the left is the FinchLab NG with data for a Solexa run. You can see that the directory has 3606 files and a number are htm summary files. You can find these 12 files in that directory of 3606 files entering "htm" in the Finch Finder.The window on the right was obtained by clicking the "Intensity Plots" link that is directly below the info table and just above the data list. In this example the intensity plots are shown for each tile of the 8 lanes on the slide. To see this better click the image and zoom in with your browser.

Labels: , , , , , ,