Monday, July 14, 2008

Maq Attack

Maq (Mapping and Assembly with Quality) is an algorithm, developed at the Sanger center, for assembling Next Gen reads onto a reference sequence. Since Maq is widely used for working with Next Generation DNA sequence data, we chose to include support for Maq in our upcoming release of FinchLab. In this post, we will discuss integrating secondary analysis algorithms like Maq with the primary analysis and workflows in FinchLab.

Improving laboratory processes through immediate feedback

The cost to run Next Generation DNA sequencing instruments and the volume of data produced make it important for labs to be able to monitor their processes in real time. In the last post, I discussed how labs can get performance data and accomplish scientific goals during the three stages of data analysis. To quickly review: Primary data analysis involves converting image data to sequence data. Secondary data analysis involves aligning the sequences from the primary data analysis to reference data to create data sets that are used to develop scientific information. An example of a secondary analysis step would be assembling reads into contigs when new genomes are sequenced. Unlike the first two stages, where much of the data is used to detect errors and measure laboratory performance, the last stage is focused on the science. In the Tertiary data analyses genomes are annotated, and data sets are compared. Thus the tertiary analyses are often the most important in terms of gaining new insights. The data used in this phase must be vetted first. It must be high quality and free from systemic errors.

The companies producing Next Gen systems recognize the need to automate primary and secondary analysis. Consequently, they provide some basic algorithms along with the Next Gen instruments. Although these tools can help a lab get started, many labs have found that significant software development is needed on top of the starting tools if they are to fully automate their operation, translate output files into meaningful summaries, and give users easy access to the data. The starter kits from the instrument vendors can also be difficult to adapt when performing other kinds of experiments. Working with Next Gen systems typically means that you will have deal with a lot of disconnected software, a lack of user interfaces, and diverse new choices for algorithms when it comes to getting your work done.

FinchLab and Maq in an integrated system

The Geospiza FinchLab integrates analytical algorithms such as Maq into a complete system that encompasses all the steps in genetic analysis. Our Samples to Results platform provides flexible data entry interfaces to track sample meta data. The laboratory information management system is user configurable so that any kind of genetic analysis procedure can be run and tracked and most importantly provides tight linkage between samples, lab work, and their resulting data. This system makes it easy to transition high quality primary results to secondary data analysis.

One of the challenges with Next Gen sequencing has been choosing an algorithm for secondary analysis. Secondary data analysis needs to be adaptable to different technology platforms and algorithms for specialized sequencing applications. FinchLab meets this need because it can accommodate multiple algorithms when it comes to secondary and tertiary analysis. One of these algorithms is Maq. Maq attractive because it can be used in diverse applications where reads are aligned to a reference sequence. Among these are Transcriptomics (Tag Profiling, EST analysis, small RNA discovery), Promoter Mapping (CHiP-Seq, DNAase hypersensitivity), Methylation analysis, and Variation Analyses (SNP, CNV). Maq offers a rich set of output files so it can be used to quickly provide an overview of your data and help you verify that your experiment is on track before you invest serious time in tertiary work. Finally Maq is being actively developed and improved and is open-source so it is easy to access and use regardless of affiliation.

Maq and other algorithms are integrated into FinchLab through the FinchLab Remote Analysis Server (RAS). RAS is a lightweight job tracking system that can be configured to run any kind of program in different computing environments. RAS communicates with FinchLab to get the data and return the results. Data analyses are run in FinchLab by selecting the sequence file(s), clicking a link to go to a page and select the analysis method(s) and reference data sets, and then clicking a button to start the work. RAS tracks the details of data processing and sends information back to FinchLab so that you can always see what happening through the web interface.

A basic FinchLab system includes the RAS and pipelines for running Maq in two ways. The first is Tag Profiling and Expression Analysis. In this operation, Maq output files are converted to gene lists with links to drill down into the data and NCBI references. The second option it to use Maq in a general analysis procedure where all the output files are made available. In the next months, new tools will convert more of these files into output that can be added to genome browsers and other tertiary analysis systems.

A final strength of RAS is that it produces different kinds of log files to track potential errors. These kinds of files are extremely valuable in trouble-shooting and fixing problems. Since Next Gen technology is new and still in constant flux, you can be certain that unexpected issues will arise. Keeping the research on track is easier when informative RAS logging and reports help to diagnose and resolve issues quickly. Not only can FinchLab help with Next Gen assays, help solve those unexpected Next Gen problems, multiple Next Gen algorithms can be integrated into FinchLab to complete the story.

Labels: , , , , ,

Wednesday, June 25, 2008

Finch 3: Getting Information Out of Your Data

Geospiza's tag line "From Sample to Results" represents the importance of capturing information from all steps in the laboratory process. Data volumes are important and lots of time is being spent discussing the overwhelming volumes of data produced by new data collection technologies like Next Gen sequencers. However, the real issue is not how you are going to store the data, rather it is what are you going to do with it? What do your data mean in the context of your experiment?

The Geospiza FinchLab software system supports the entire laboratory and data analysis workflow to convert sample information into results. What this means is that the system provides a complete set of web-based interfaces and an underlying database to enter information about samples and experiments, track sample preparation steps in the laboratory, link the resulting data back to samples, and process the data to get biological information. Previous posts have focused on information entry, laboratory workflows, and data linking. This post will focus on how data are processed to get biological information.

The ultra-high data output of Next Gen sequencers allows us to use DNA sequencing to ask many new kinds of questions about structural and nucleotide variation and measure several indicators of expression and transcription control on a genome-wide scale. The data produced consists of images, signal intensity data, quality information, and DNA sequences and quality values. For each data collection run, the total collection of data and files can be enormous and can require significant computing resources. While all of the data have to be dealt with in some fashion, some of the data have long-term value while other data are only needed in the short term. The final scientific results will often be produced by comparing data sets created from the DNA sequences and their comparison to reference data.

Next Gen data are processed in three phases.

Next Gen data workflows involve three distinct phases of work: 1. Data are collected from control and experimental samples. 2. Sequence data obtained from each sample are aligned to reference sequence data, or data sets to produce aligned data sets 3. Summaries of the alignment information from the aligned data sets are compared to produce scientific understanding. Each phase has a discrete analytical process and we, and others, call these phases primary data analysis, secondary data analysis and tertiary data analysis.

Primary data analysis involves converting image data to sequence data. The sequence data can be in familiar "ACTG" sequence space or less familiar color space (SOLiD) or flow space (454). Primary data analysis is commonly performed by software provided by the data collection instrument vendor and it is the first place where quality assessment about a sequencing run takes place.

Secondary data analysis creates the data sets that will be further used to develop scientific information. This step involves aligning the sequences from the primary data analyses to reference data. Reference data can be complete genomes, subsets of genomic data like expressed genes, or individual chromosomes. Reference data are chosen in an application specific manner and sometimes multiple reference data sets will be used in an iterative fashion.

Secondary data analysis has two objectives. The first is to determine the quality of the DNA library that was sequenced, from a biological and sample perspective. The primary data analysis supplies quality measurements that can used to determine if the instrument ran properly, or whether the density of beads or clusters were at their optimum to deliver the highest number of high quality reads. However, those data do not tell you about the quality of the samples. Answering questions about sample quality, such as did the DNA library contain systematic artifacts such as sequence bias? Were there high numbers of ligated adaptors or incomplete restriction enzyme digests, or any other factors that would interfere with interpreting the data? These kinds of questions are addressed in the secondary data analysis by aligning your reads to the reference data and seeing that your data make sense.

The second objective of secondary data analysis is to prepare the data sets for tertiary analysis where they will be compared in an experimental fashion. This step involves further manipulation of alignments, typically expressed in very large hard to read algorithm specific tables, to produce data tables that can be consumed by additional software. Speaking of algorithms, there is a large and growing list to choose from. Some are general purpose and others are specific to particular applications, we'll comment more on that later.

Tertiary data analysis represents the third phase of the Next Gen workflow. This phase may involve a simple activity like viewing a data set in a tool like a genome browser so that the frequency of tags can be used to identify promoter sites, patterns of variation, or structural differences. In other experiments, like digital gene expression, tertiary analysis can involve comparing different data sets in a similar fashion to microarray experiments. These kinds of analyses are the most complex; expression measurements need to be normalized between data sets and statistical comparisons need to be made to assess differences.

To summarize, the goal of primary and secondary analysis is to produce well-characterized data sets that can be further compared to obtain scientific results. Well-characterized means that the quality is good for both the run and the samples and that any biologically relevant artifacts are identified, limited, and understood. The workflows for these analyses involve many steps, multiple scientific algorithms, and numerous file formats. The choices of algorithms, data files, data file formats, and overall number of steps depend the kinds of experiments and assays being performed. Despite this complexity there are standard ways to work with Next Gen systems to understand what you have before progressing through each phase.

The Geospiza FinchLab system focuses on helping you with both primary and secondary data analysis.

Labels: , , , , , ,

Friday, June 13, 2008

Finch 3, Linking Samples and Data

One of the big challenges with Next Gen sequencing is linking sample information with data. People tell us: "It's a real problem." "We use Excel, but it is hard." "We're losing track."

Do you find it hard to connect sample information with all the different types of data files? If so you should look at FinchLab.

A review:

About a month ago, I started talking about our third version of the Finch platform and introduced the software requirements for running a modern lab. To review, labs today need software systems that allow them to:

1. Set up different interfaces to collect experimental information
2. Assign specific workflows to experiments
3. Track the workflow steps in the laboratory
4. Prepare samples for data collection runs
5. Link data from the runs back to the original samples
6. Process data according to the needs of the experiment

In FinchLab, order forms are used to first enter sample information into the system. They can be created for specific experiments and the samples entered will, most importantly, be linked to the data that are produced. The process is straightforward. Someone working with the lab, a customer or collaborator, selects the appropriate form and fills out the requested information. Later, an individual in the lab reviews the order and, if everything is okay, chooses the "processing" state from a menu. This action "moves" the samples into the lab where the work will be done. When the samples are ready for data collection they are added to an "Instrument run." The instrument run is Finch's way of tracking which samples go in what well of a plate or lane/chamber on a slide. The samples are added to the instrument and data are collected.

The data

Now comes the fun part. If you have a Next Gen system you'll ultimately end up with 1000's of files scattered in multiple directories. The primary organization for the data will be in unix-style directories, which are like Mac or Windows folders. Within the directories you will find a mix of sequence files, quality files, files that contain information about run metrics and possibly images. You'll have to make decisions about what to save for long-term use and what to archive, or delete.

As noted, the instrument software organizes the data by the instrument run. However, a run can have multiple samples, and the samples can be from different experiments. A single sample can be spread over multiple lanes and chambers of a slide. If you are running a core lab, the samples will come from different customers and your customers often belong to different lab groups. And there is the analysis. The programs that operate on the data require specific formats for input files and produce many kinds of output files. Your challenge is to organize the data so that it is easy to find and access in a logical way. So what do you do?

Organizing data the hard way

If you do not have a data management system, you'll need to write down which samples go with which person, group or experiment. That's pretty simple. You can tape a piece of paper on the instrument and write this down, or you can diligently open a file, commonly an Excel spreadsheet, and record the info there. Not too bad, after all there are only a handful of partitions on a slide (2, 8, 16) and you only run the instrument once or twice a week. If you never upgrade your instrument, or never try and push too many samples through, then you're fine. Of course the less you run your instrument the more your data cost and the goal is to get really good at running your instrument, as frequently as possible. Otherwise you look bad at audit time.

Let's look at a scenario where the instrument is being run at maximal throughput. Over the course of a year, data from between 200 and 1000 slide lanes (chambers) may be collected. These data may be associated with 100's or 1000's of samples and belong to a few or many users in one or many lab groups. The relevant sequence files are between a few hundred megabytes to gigabytes in size; they exist in directories with run quality metrics and possibly analysis results. To sort this out you could have committee meetings to determine whether data should be organized by sample, experiment, user, or group, or you could just pick an organization. Once you've decided on your organization you have to set up access. Does everyone get a unix account? Do you set up SAMBA services? Do you put the data on other systems like Macs and PCs? What if people want to share? The decisions and IT details are endless. Regardless, you'll need a battery of scripts to automate moving data around to meet your organizational scheme. Or you could do something easier.

Organizing data the Finch way

One of FinchLab's many strengths is how it organizes Next Gen data. Because the system tracks samples and users, and has group and permissions models, issues related to data access and sharing are simplified. After a run is complete, the system knows which data files go to what samples. It also knows which samples were submitted by each user. Thus data can be maintained in the run directories that were created by the instrument software to simplify file-based organization. When a run is complete in FinchLab a data link is made to the run directory. The data link informs the system which files go with a run. Data processing routines in the system sort the data into sequences, quality metric files, and other data. At this stage data are associated with samples. Once this is done, the lab has easy access to the data via web pages. The lab can also make decisions about access to data and how to analyze the data. These last two features make FinchLab a powerful system for core labs and research groups. With only few clicks your data are organized by run, user, group, and experiment - and you didn't have to think about it.



Labels: , , , , , , ,

Thursday, June 5, 2008

Finishing in the Future

"The data sets are astronomical," "the data that needs to be attached to sequences is unbelievable," and "browsing [data] is incomprehensible." These are just three of the many quotes I heard about the challenges associated with DNA sequencing last week at the "Finishing in the Future Meeting" sponsored by the Joint Genome Institute (JGI) and Los Alamos National Laboratory (LANL).

Metagenomics

The two and half day conference, focused on finishing genomic sequences, kicked off with a session on metagenomics. Metagenomics is about isolating DNA from environments and sequencing random molecules to "see what's out there." Excitement for metagenomics is being driven by Next Gen sequencing throughput, because so many sequences can be collected relatively inexpensively. A benefit of being able to collect such large data sets is that we can interrogate organisms that can cannot be cultured. The first talk, "Defining the Human Microbiome: Friends or Family," was presented by Bruce Birren from the Broad Institute of MIT & Harvard. In this talk, we learned about the HMP (Human Microbiome Project), a project dedicated to characterizing the microbes that live on our bodies. It is estimated that microbial cells out number our cells by ten to one. It has long been speculated that our microbiomes are involved in our health and sickness and recent studies are confirming these ideas.

Sequencing technologies continue to increase data throughput

The afternoon session opened with presentations from Roche (454), Illumina, and Applied Biosystems on their respective Next Gen sequencing platforms. Each company presented the strengths of their platform and new discoveries that are being made by virtue of having a lot of data. Each company also presented data on improvements designed to produce even more data and road maps for future improvement to produce even more data. As Haley Fiske from Illumina put it, "we're in the middle of an arms race!" Finally, all the companies are working on molecular barcodes, so that multiple samples can be analyzed within an experiment. So, we started with a lot of data from a sample and are going to a lot of data from a lot of samples. That should add some very nice complexity to sample and data tracking.

A unique perspective

Sydney Brenner opened the second day with a talk on "The Unfinished Genome." The thing I like most about a Sydney Brenner talk is how he puts ideas together. In this talk he presented how one could look at existing data and literature to figure things out or make new discoveries. In one example, he speculated on when the genes for eye development may have first appeared. From the physiology of the eye you can use the biochemistry of vision to identify the genes that encode the various proteins involved in the process. These proteins are often involved in other process, but differ slightly. They arise from gene duplication and modification. So, you can look at gene duplications and measure the age of a duplication by looking at neighboring genes. If a duplication event is old, neighboring genes will be unequal distances apart. You can use this information, along with phylogenetic data, to estimate when the events occurred. Of course this kind of study benefits from more sequence data. Sydney encouraged everyone to keep sequencing.

Sydney closed his talk by making a fun analogy where genomics is like astronomy and thus should have been called "genomy." He supported his analogy by noting that astronomy has astro physic and genomics has genetics. Both are quantitative and measure history and evolution. Astronomy also has astrology, the prediction of an individual's future from the stars. Similarly, folks would like to predict an individual's future from their genes and suggested we call this work "Genology," since it has the same kind of scientific foundation as astrology.

Challenges and solutions

The rest of the conference and posters focused on finishing projects. Today the genome centers are making use of all the platforms to generate large data sets and finish projects. A challenge for genomics is lowering finishing costs. The problem being that generating "draft" data has become so inexpensive and fast that finishing has become a signifiant bottleneck. Finishing is needed to produce the high quality referece sequences that will inform our genomic science, so investigarting ways to lower finishing costs is a worthwhile endeavour. Genome centers are approaching this problem by looking at ways to mix data from different technologies such as 454 and Illumina or SOLiD. They are also developing new and mixed software approaches such as combining multiple assembly algorithms to improve alignments. These efforts are being conducted in conjunction with experiments where mixtures of single pass and paired read data sets are tested to determine optimal approaches for closing gaps.

The take home from this meeting is that, over the coming years, a multitude of new approaches and software programs will emerge to enable genome scale science. The current technology providers are aggressively working to increase data throughput, data quality and read length to make their platforms as flexible as possible. New technology providers are making progress on even higher throughput platforms. Computer scientists are working hard on new algorithms and data visualizations to handle the data. Molecular barcodes will allow for greater numbers of samples per data collection event and increase sample tracking complexity.

The bottom line

Individual research groups will continue to have increasing access to "genome center scale" technology. However, the challenges with sample tracking, data management, and data analysis will be daunting. Research groups with interesting problems will be cut off from these technologies unless they have access to cost-effective, robust informatics infrastructures. They will need help setting up their labs, organizing the data, and making use of new and emerging software technologies.

That's where Geospiza can help.

Labels: , , , , ,

Monday, May 26, 2008

Finch 3: Managing Workflows

Genetic analysis workflows begin with RNA or DNA samples and end with results. In between, multiple lab procedures and steps are used to transform materials, move samples between containers, and collect the data. Each kind of data collected and each data collection platform requires that different laboratory procedures are followed. When we analyze the procedures, we can identify common elements. A large number of unique workflows can be created by assembling these elements in different ways.

In the last post, we learned about the FinchLab order form builder and some of its features for developing different kinds of interfaces for entering sample information. Three factors contribute to the power of Finch orders. First, labs can create unique entry forms by selecting items like pull down menus, check boxes, radio buttons, and text entry fields for numbers or text, from a web page. No programming is needed. Second, for core labs with business needs, the form fields can be linked to diverse price lists. Third, the subject of this post, is that the forms are also linked to different kinds of workflows.

What are Workflows?

A workflow is a series of series of steps that must be performed to complete a task. In genetic analysis, there are two kinds of workflows: those that involve laboratory work, and those that involve data processing and analysis. The laboratory workflows prepare sample materials so that data can be collected. For example, in gene expression studies, RNA is extracted from a source material (cells, tissue, bacteria), and converted to cDNA for sequencing. The workflow steps may involve purification, quality analysis on agarose gels, concentration measurements, and reactions where materials are further prepared for additional steps.

The data workflows encompass all the steps involved in tracking, processing, managing, and analyzing data. Sequence data are processed by programs to create assemblies and alignments that are edited or interrogated to create genomic sequences, discover variation, understand gene expression, or perform other activities. Other kinds of data workflows such as microarray analysis, or genotyping involve developing and comparing data sets to gain insights. Data workflows involve file manipulations, program control, and databases. The challenge for the scientist today, and the focus of Geospiza's software development is to bring the laboratory and data workflows together.

Workflow Systems

Workflows can be managed or unmanaged. Whether you work at the bench or work with files and software, you use a workflow any time you carry out a procedure with more than one step. Perhaps you wite the steps in your notebook, check them off as you go, and tape in additional data like spectrophotometer readings or photos. Perhaps you write papers in Word and format the bibliography with Endnote or resize photos with Photoshop before adding them to a blog post. In all these cases you performed unmanaged workflows.

Managing and tracking workflows becomes important as the number of activities and number of individuals performing them increase in scale. Imagine your lab bench procedures performed multiple times a day with different individuals operating particular steps. This scenario occurs in core labs that perform the same set of processes over and over again. You can still track steps on paper, but it's not long before the system becomes difficult to manage. It takes too much time to write and compile all of the notes, and it's hard to know which materials have reached which step. Once a system goes beyond the work of a single person, paper notes quit providing the right kinds of overviews. You now need to manage your workflows and track them with a software system.

A good workflow system allows you to define the steps in your protocols. It will provide interfaces to move samples through the steps and also provide ways to add information to the system as steps are completed. If the system is well-designed, it will not allow you do things at inappropriate times or require too much "thinking" as the system is operated. A well-designed system will also reduce complexity and allow you to build workflows through software interfaces. Good systems give scientists the ability to manage their work, they do not require their users to learn arcane programming tools or resort to custom programming. Finally, the system will be flexible enough to let you create as many workflows as you need for different kinds of experiments and link those workflows to data entry forms so that the right kind of information is available to right process.

FinchLab Workflows

The Geospiza FinchLab workflow system meets the above requirements. The system has a high level workflow that understands that some processes require little tracking (a quick test) and other's require more significant tracking ("I want to store and reuse DNA samples"). More detailed processes are assigned workflows that consist of thee parts: A name, a "State," and a "Status." The "State" controls the software interfaces and determines which information are presented and accessed at different parts of a process. A sequencing or genotyping reaction, for example, cannot be added to a data collection instrument until it is "ready." The other part specifies the steps of the process. The steps of the process (Statuses) are defined by the lab and added to a workflow using the web interfaces. When a workflow is created, it is given a name, as many steps as needed, and it is assigned a State. The workflows are then assigned to different kinds of items so that the system always knows what to do next with the samples that enter.

A workflow management system like FinchLab makes it just as easy to track the steps of Sanger DNA sequencing, as it is to track the steps of a Solexa, SOLiD, or 454 sequencing processes. You can also, in the same system, run genotyping assays and other kinds of genetic analysis like microarrays and bead assays.


Next time, we'll talk about what happens in the lab.

Labels: , , , , ,

Tuesday, May 20, 2008

Finch 3: Defining the Experimental Information

In today's genetic analysis laboratory, multiple instruments are used to collect a variety of data ranging from DNA sequences to individual values that measure DNA (or RNA) hybridization, nucleotide incorporations, or other binding events. Next Gen sequencing adds to this complexity and offers additional challenges with the amount of data that can be produced for a given experiment.

In the last post, I defined basic requirements for a complete laboratory and data management system in the context of setting up a Next Gen sequencing lab. To review, I stated that laboratory workflow systems need to perform the following basic functions:
  1. Allow you set up different interfaces to collect experimental information
  2. Assign specific workflows to experiments
  3. Track the workflow steps in the laboratory
  4. Prepare samples for data collection runs
  5. Link data from the runs back to the original samples
  6. Process data according to the needs of the experiment
I also added that if you operate a core lab, you'll want to bill for your services and get paid.

In this post I'm going to focus on the first step, collecting experimental information. For this exercise let's say we work in a lab that has:
  • One Illumina Solexa Genome Analyzer
  • One Applied Biosystems SOLiD System
  • One Illumina Bead Array station
  • Two Applied Biosystems 3730 Genetic Analyzers, used for both sequencing and fragment analysis















This image shows our laboratory home page. We run our lab as a service lab. For each data collection platform we need to collect different kinds of sample information. One kind of information is the sample container. Our customer's samples will be sent the lab in many different kinds of containers depending on the kind of experiment. Next Gen sequencing platforms like SOLiD, Solexa, and 454 are low throughput with respect to sample preparation, so samples will be sent to us in tubes. Instruments like the Bead Array and 3730 DNA sequencing instrument, usually involve sets of samples in 96 or 384 well plates. In some cases, samples start in tubes and end up in plates, so you'll need to determine which procedures use tubes and which use plates and how the samples will enter the lab.

Once the samples have reached the lab, and been checked, you are also going to do different things to the samples in order to prepare them for the different data collection platforms. You'll want to know which samples should go to what platforms and have the workflows for different processes defined so that they are easy to follow and track. You might even want to track and reuse certain custom reagents like DNA primers, probes and reagent kits. In some cases you'll want to know physical information, like DNA, RNA, or concentration, upfront. In other cases you'll determine information later.

Finally, let's say you work at an institution that focuses on a specific area of research, like cancer, or mouse genetics, or plant research. In these settings you might want to also track information about sample source. Such information could include species, strain, tissue, treatment or many other kinds of things. If you want to explore this information later you'll probably want to define a vocabulary that can be "read" by computer programs. To ensure that the vocabulary can be followed, interfaces will be needed to enter this information without typing or else you'll have a problem like pseudomonas, psuedomonas, or psudomonas.

Information systems that support the above scenarios have to deal with a lot of "sometimes this" and "sometimes that" kinds of information. If one path is taken, Sanger sequencing on a 3730, different sample information and physical configurations are needed than we need with Next Gen sequencing. Next Gen platforms have different sample requirements too. SOLiD and 454 require emulsion PCR to prepare sequencing samples, whereas Solexa, amplifies DNA molecules on slides in clusters. Additionally, the information entry system also has deal with "I care" and "I don't care" kinds of data like information about sample sources, or experimental conditions. These kinds of information are needed later to understand the data in the context of the experiment, but do not have much impact on the data collection processes.

How would you create a system to support these diverse and changing requirements?

One way to do this would be to build a form with many fields and rules for filling it out. You know those kinds of forms. They say things like "ignore this section if you've filled out this other section." That would be a bad way to do this, because no one would really get things right, and the people tasked with doing the work would spend a lot of time either asking questions about what they are supposed to be doing with the samples or answering questions about how to fill out the form.

Another way would be to tell people that their work is too complex and they need custom solutions for everything they do. That's expensive.

A better way to do this would be to build a system for creating forms. In this system, different forms are created by the people who develop the different services. The forms are linked to workflows (lab procedures) that can understand sample configurations (plates, tubes, premixed ingredients, and required information). If the systems is really good, you can easily create new forms and add fields to them to collect physical information (sample type, concentration) or experimental information (tissue, species, strain, treatment, your mothers maiden name, favorite vacation spot, ...) without having to develop requirements with programmers and have them build forms. If your system is exceptionally good, smart, and clever it will let you create different kinds of forms and fields and prevent you from doing things that are in direct conflict with one another. If your system is modern, it will be 100% web-based and have cool web 2.0 features like automated fill downs, column highlighting, and multi-selection devices so that entering data is easy, intuitive, and even a bit fun.

FinchLab, built on the Finch 3 platform, is such a system.

Labels: , , , ,

Tuesday, May 6, 2008

The Next Gen Sequencing Lab

Illumina's Genome Center in a mailroom message really captures the impact of next generation sequencing technology. Each Illumina Genome Analyzer, AB SOLiD instrument, or Roche Genome Sequencer (454) has the per run capacity of a Genome Center's daily output. More importantly this is possible because you can do your DNA prep work on a single lab bench. Of course you'll have to find someplace to put the data.

In the old days (last year) if you wanted to collect data on a genome center scale, you had to not only have a large warehouse with 100's of capillary electrophoresis genetic analyzers, you also had to have multiple large rooms that were devoted to sample preparation. In the largest genome centers, one full room is used to prepare DNA libraries, another is used to purify DNA templates and finally a large space is need to run the sequencing reactions (we're not even talking about media, autoclaves and other support). Multiple robots are required to pick bacterial colonies, transfer liquids between 384-well plates, and aliquot purified DNA, primers, and enzyme/nucleotide cocktails. To support these activities a small army of technicians work to set up the materials, move plates through the process, and load the instruments. This is all tracked by a custom LIMS (Laboratory Information Management System) and team of developers who keep it running and develop tools to process the data.

With Next Gen sequencing all of this is replaced by a mailroom, laboratory bench, and a couple of people.

While you can make do with less space, fewer people, robotics, and custom LIMS systems, you do need to track what is happening at the bench. You are probably also going to want to know which of those many thousands of files go with what samples. Today's Next Gen systems allow you to partition your sequencing materials into slide chambers (also called lanes and sections) to give between eight and 32 separate data sets per run. To track samples and lab workflows, and link data and results together you will need to have a software system that can perform the following basic functions:
  1. Allow you set up different interfaces to collect experimental information
  2. Assign specific workflows to experiments
  3. Track the workflow steps in the laboratory
  4. Prepare samples for data collection runs
  5. Link data from the runs back to the original samples
  6. Process data according to the needs of the experiment
And if you are a core lab you'll likely want to set up experiments as services and create billing statements for the work.

Traditionally, this kind of system was only possible through custom software development, either you did it yourself or you worked with a company to build the features that were needed. Now you can get this support in a software product that is quick to deploy and can be configured to your needs. Over the coming weeks and months I'll show you how this can be done with the Geospiza FinchLab. If you want to know now give us a call.