Dated: 4/1/2006
New tools help biologists integrate complex datasets.
Today’s burgeoning field of systems biology takes researchers away from traditional one-geneat-a-time bench experiments in favor of combining technologies from fields such as genomics, mass spectrometry, imaging, and informatics to advance their understanding of biological systems.
The advent of high-throughput (HTP) technologies, such as transcriptomics (micro–arrays) and proteomics, has been fueling a revolution in biology by enabling systems level analysis. These HTP approaches are especially promising for characterizing biomolecules at a global scale; however, the large, heterogeneous data sources make interpretation especially challenging. Microarrays–a method used to investigate the expression levels of thousands of genes simultaneously– produce data with very high dimensionality and a lot of variability from experiment to experiment. Proteomics, the study of protein expression patterns in organisms, is a vast, complex fi eld that requires tools such as powerful separation methods and mass spectrometers, and advanced algorithms to automate data processing.
The lack of computational capabilities to analyze bulky datasets from HTP techniques is a bottleneck for this new era of biology research. As a consequence, many investigators who conduct experiments that generate massive amounts of data find few options for analyzing the data; therefore, the full potential of the studies is never realized. Combining HTP measurements in a given experiment may provide more information, but may also exacerbate the problem. A global challenge for biology is integrating HTP data into computational models that predict cell response.
“There’s a need for heterogeneous data. Most analytical technologies provide only a single dimension of data,” says Steven Wiley, Director of Systems Biology at the U.S. Department of Energy’s Pacific Northwest National Laboratory (PNNL) in Richland, WA. “Complex processes involve multi-step transformations, and you can’t understand a process without being able to obtain and integrate data on all of its dimensions.”
One example is a change in protein expression. There are multiple levels at which protein production can be controlled, not to mention post-translational modifications that often dictate protein function. To understand how the abundance and activity of a given protein are regulated, a biologist has to be able to analyze changes in the different forms of the molecules over various time scales.
But data analysis doesn’t end with analyzing the numbers. Effective visualizations can help scientists explore and interpret data. Indeed, scientists in many domains have come to trust and rely on graphical representations of large datasets. Humans can perceive patterns that statistical and visualization methods reveal. Such patterns suggest relationships in the data that a biology expert can use. Visualization technologies simplify the way we analyze massive amounts of complex data from multiple sources.
In with biology’s compute-and data-intensive nature. PNNL is home to the Proteomics Research Resource for Integrative Biology, funded by the National Center for Research Resources (NCRR) component of the National Institutes of Health. PNNL scientists are generating terabyte-scale data and know first-hand the need for new ways to manage it.
According to Dick Smith, Director of the NCRR and an international expert in high-resolution mass spectrometry, “Even in these early days of proteomics, our ability to generate data signifi cantly exceeds our ability to effectively use all of the data. Investments in this area will become even more important over the next few years as further increases in measurement throughput are realized and, rather than being drowned by the flood of data, we are better able to deal with the complexity of biological systems.”
PNNL is developing bioinformatics/data management tools to archive, manage and analyze biological data. Most of these tools will be publicly available at no charge. PNNL is also a global leader in the field of information analytics, housing the National Visualization and Analytics Center (NVAC), which provides national strategic leadership for visual analytics technology. The Laboratory’s combined expertise in computational and biological sciences creates an ideal environment to realize the promise of systems biology with state-of-the-art technology and software tools.

Bioinformatics resource management PNNL is developing a systems biology computational framework using a problem-solving environment called Bioinformatics Resource Manager (BRM) that automates routine data-merging and data-mining methods, and provides seamless linkage to data visualization tools.1 This will allow scientists to efficiently analyze massive amounts of data. The BRM is designed to be operated by biologists looking for greater insights into their experimental results. It retrieves data to the desktop, making data integration easier across technologies and across experiments. This is essential for prioritizing follow-on experiments. Linkage to data visualization tools facilitates discovery through human perception of patterns within the data. Finally, the BRM can format data according to specified requirements so that it can be easily exported to commercial tools or shared with collaborators.
The overall extensible framework of the BRM is facilitated by a multi-tiered, structured architecture that provides a component-based capability for data resources and analytical tools (Figure 1). This architecture is designed such that a few basic components are built and deployed initially to a user base, while new functionality is continually added through the development of new components. The current system provides access to internal data sources, such as microarray and proteomic data, as well as data from the National Center for Biotechnology Information, and the Kyoto Encyclopedia of Genes and Genomes.
BRM uses Remote Method Invocation to launch several external data analysis tools, including NCBI’s PubMed, the Conserved Domain Architecture Retrieval Tool (CDART), and the Basic Local Alignment Search Tool (BLAST). It also links to Cytoscape, a publicly available network visualization tool2, and many analysis tools that are being developed at PNNL, such as PQuad, described in more detail below.3
Designed for biologists The ability to move and merge data within BRM is especially enabling to the data integration needs of systems biology. Multiple datasets are merged by overlapping gene and protein sequence identifiers, which traditionally requires several cross-referencing tables or annotation retrieval tools. Because of inconsistencies in public databases, BRM uses a combination of gene identifiers to compare datasets, such as RefSeq, Unigene, and Gene symbols. Then, data available in public databases–including biological function (process and pathway) and subcellular localization data–are retrieved and merged into the final dataset. Finally, integrated analyses can be performed on this merged dataset. For example, gene expression can be compared with protein abundance measurements within an experiment. Functional information can then be clustered with expression data to visualize patterns on a pathway level.
Although the very first step of matching gene and protein identifiers sounds simple, it is surprisingly non-trivial and is frequently the most severe bottleneck in the entire process. Challenges include inconsistency in gene and protein annotation across species, redundancy in probe sets on microarrays, uncertainty in protein identifications, and the lack of tools for cross-referencing gene and protein identifiers through batch retrieval for large datasets. BRM is meeting these challenges.
BRM meets a host of biological data requirements. As shown in Figure 2, it defines and develops data services to gather data from microarray, proteomic and cross-reference data sources; and it provides analytical tools that fuse multiple data streams within a coherent visual environment to support hypothesis-driven research. BRM can provide connectivity to targeted genomic, proteomic, metabolomic and annotation data from private and community sources.
PQuad visualization tool PQuad is a novel proteomics visualization tool that combines DNA or RNA sequence and proteomics data at multiple resolutions. It enables visualization of the data at the chromosome, gene, and sequence levels. The linked interactive views of mass spectrometry-based proteomics data allow the user to view empirical evidence of peptides (and therefore proteins) in the context of the genome and proposed open reading frame (ORF) location. PQuad also allows comparative analysis of proteomics data, with a Venn diagram feature that displays unique and common proteins across treatments or experiments.
PQuad provides three key levels of detail necessary for the analysis of peptides and proteins from Porkaryotes (Figure 3). The Genome View gives a bird’s eye view of the proteomic data at the chromosome or plasmid level. The Sequence View allows the user to drill down to the individual nucleotides and amino acids of interest. Finally, the Protein View allows the user to observe the data in the context of the six-frame translation of the DNA or RNA strand.
The ability to interact with proteomics data in a visualization environment can be valuable, as dealing with large, static spreadsheets of information is time-consuming and difficult. Also, to infer meaningful information from the data, a researcher must be able to view the data in the context of additional biological information such as gene expression or networks.

The Protein View of PQuad displays all six reading frames of a section of DNA, and the associated proteins (yellow bars) with the observed peptides (red bars) overlaid (see Figure 3). Functional information about proteins can be encoded in color. A simple visual query of the data will reveal information such as protein expression and the confidence in those predictions as a measure of protein coverage by peptides. From a biological perspective, finding numerous peptides in the same reading frame (i.e., on the same horizontal line) with similar functions could indicate an operon, which is a controllable unit of gene regulation.
The PNNL team has built several unique capabilities into PQuad, such as visualization of proteomic data in the context of protein function, comparative proteomics (highlighted differences across multiple datasets), and advanced query capabilities allowing the selection of peptides/proteins of interest based on user-driven requirements (such as peptide quality metrics and linkage to network-style visualizations using Cytoscape). Figure 4 gives an example of how the network visualization is linked to the comparative proteomics capabilities; it displays common and unique peptides for three different conditions. PQuad can currently analyze proteins from microbes, but the development team is modifying the tool to work with more complex cell systems as well as with animal models.
Preparing for peta-scale data Advancements in genomics and proteomics technologies have enabled scientists to collect massive amounts of data, taking us closer to solving complex biological problems. Major strides have been made in the field of computational biology, not just in developing statistical tools, but also in integrating metadata sources and data storage in a query-compatible environment. The exponential growth in the amount of data collected in research, however, has created an urgent technical challenge for computer scientists.
“We are approaching a peta-scale computing challenge,” says George Michaels, Associate Laboratory Director for PNNL’s Computational and Information Sciences Directorate. “From a computing standpoint, technology currently can’t manage the largescale and data-intensive enterprises.”
Data-intensive computing is not just an evolutionary change in informatics, but is a revolutionary change in the way information is gathered and processed, from the hardware and algorithms that are used, to the presentation of knowledge to the end user. Scientists at PNNL are working to combine the power of supercomputers with advanced algorithms and novel parallel computing architectures optimized for peta-scale memory access. On the software side, new scalable data analysis tools are being developed for discovering patterns in large heterogeneous datasets and for integrating data across different sources and spatial and temporal scales. These advances will be crucial to the success of systems biology and the development of computational models of cells and organ systems.

As for the future, biology will begin to seem less like a descriptive science driven by experiments and more like a predictive science, driven by data analysis and large-scale simulations. These changes will not come quickly, but will come as an inevitable consequence of our desire to control the quality of our health and environment. Systems biology is clearly part of our future, and powerful computers and software systems will be essential for its success.
References 1. Singhal M, EG Stephan, KR Klicker, LL Trease, G Chin, Jr., DK Gracio, and DA Payne. 2004. “Enabling systems biology: a scientifi c problem-solving environment.” In Computational Science - ICCS 2004: 4th International Conference, Kraków, Poland, June 6-9, 2004, Proceedings, Part II 3037:540-547. Springer-Verlag, Berlin, Germany.
2. Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T. 2003. “Cytoscape: a software environment for integrated models of biomolecular interaction networks.” Genome Research 13(11):2498-2504.
3. Havre SL, M Singhal, DA Payne, MS Lipton, and BM Webb-Robertson. 2005. “Enabling proteomics discovery through visual analysis.” IEEE Engineering in Medicine and Biology Magazine 24(3):50-57. Katrina M. Waters is a senior research scientist in the Computational Biology and Bioinformatics groups, Mudita Singhal is a bioinformatics and information visualization researcher, Bobbie-Jo M. Webb-Robertson specializes in statistical analysis and tool development, Eric G. Stephan specializes in data modeling and database architecture, and Julie M. Gephart is a senior writer assigned to the Lab’s Systems Biology program and Biological Sciences Division, all at PNNL. They may be contacted at sceditor@scimag.com.
|