Appendix 2

Research Opportunities in Computational Biology

Executive Summary

Computational Biology is emerging as a current analog to the development of molecular biology as a discipline in its own right. In the late 1950s and early 1960s a group of scientists began to apply the tools of several disciplines; genetics, microbiology, physics, biochemistry, and biophysics, to analyze biological problems in a new way. The power of this approach was so great that it emerged as a discipline itself and is now known as molecular biology. As described in this document, the application of mathematical and computational tools to all areas of biology is producing equally exciting results, is providing insights into biological problems too complex for traditional analysis, and is emerging as a new discipline within the biological sciences.

There is a consensus among all observers that biology, regardless of the sub-speciality, is overwhelmed with a large amount of very complex data. However, what sets biology apart form other data rich _elds is the complexity rather than the sheer volume of the data produced. In contrast to other data rich _elds, biology remains a scienti_c "cottage industry," with the data generation done in a highly distributed mode, with no standard format or syntax.

Thus, all areas of the biological sciences have urgent needs for the organized and accessible storage of biological data. Generally this is referred to as biological database development, however, this terminology infers traditional database technology such as transaction oriented relational database systems. Unfortunately, relational database technology is inadequate to serve many areas of the biological sciences due to the complexity of biological data and the absence of a standardized data structure. It is clear that collaboration between computer scientists and biologists will be necessary to design information platforms which accommodate the needs for variation in the representation of biological data, the distributed nature of the data acquisition system, the variable demands placed on different data sets, and the absence of adequate algorithms for data comparison, which forms the basis of biological science.

There have been dramatic advances in commercially available hardware over the past few years and it has had an effect at both the high and low ends of the spectrum. In the past this general purpose hardware was inadequate to address the most computationally intense problems in the biological sciences. These computational problems were best be handled by special purpose equipment designed by teams of biologists and chip and circuit designers. This condition has been dramatically altered in the past two years as high performance general purpose instruments have become more widely available. Not only hardware limitations have affected the productivity of the computational biologist. There is a continuing need for new algorithm development to cover many tasks, especially comparisons between objects and images. Imaging technology is central to almost all of biology and data representation through image construction remains an elusive but astoundingly powerful tool. The full utilization of modern CAD tools in computational biology will advance image analysis, but will require intense software and hardware development because of the complexity of biological data

During the last decade there were dramatic advances in instrumentation and related methodologies for both light and electron microscopy. The advances lie not simply in higher resolution, but rather in a broader size range of structures that can be analyzed, and more powerful methods for putting together the pieces of three-dimensional puzzles of cell form, and the addition of dynamic details of biological form and function, ranging from the subcellular to the physiological level. The new approaches are computationally demanding. Extant computational resources, which were typically set up for entirely different processing needs, not surprisingly, are proving inadequate for dealing with the massive data _ow. An effort to develop new computational approaches is underway in a few laboratories around the world. However, it is important that new software be developed within the context of the experimental research driving the needs; that is, there must be close collaboration between those developing the software and the groups carrying out research on static and dynamic structures. Furthermore, augmentation of the experimental environment, particularly image processing equipment and other specialized equipment, is needed. Positions for sophisticated programmers are even more important. A prime example of the need for such a laboratory-based specialized programming effort is the development of workstations for interactive visualization and interpretation of 3-D data. The development will proceed in pace with experimental research only if it is done in an environment "open" in the terms used by the computer science world, where new applications are developed free from proprietary restraints and distributed as source code to other laboratories facing the same experimental needs. Commercial interests or specialized production groups will be required _nally, to add value to the base line development, producing highly reliable ("bullet proof") production line products.

X-ray crystallography and NMR are the major experimental methods for deducing macromolecular structures at atomic resolution. NMR and X-ray crystallography both produce extremely large amounts of data and are entirely dependent upon the availability of powerful computers and sophisticated processing algorithms for the interpretation of raw data. In addition, there are fundamental scienti_c problems in both areas that require major computational advances. In addition, substantial opportunities exist for combining structural information from several experimental techniques. This may provide the basis for a structural solution where only partial data are available from any single technique.With improved computational tools, combining physical data from a variety of sources may become common place. These developments will allow solutions to be obtained for structural problems which would otherwise be intractable. Analysis of errors in structures based upon experimental data from several sources also represents a new computational challenge.

Advances in X-ray and NMR data analysis will lead directly to rapid developments in the _eld of protein folding which will be synergistic with developments in other areas of biology itself, and especially computational biology. Common problems of data representation, search strategy, pattern recognition and data visualization appear in many _elds. There is a particularly exciting synergistic relationship between the protein folding _eld and those of structure determination by X-ray crystallography and 2-D NMR. Each _eld will bene_t from rapid advances in the other disciplines. Improved folding algorithms provide a new way to attack the phase problem in crystallography, and new, more carefully re_ned protein structures provide rich new insights into protein folding.

Various initiatives in computational neurobiology give us the hope of interpreting the mass of anatomical and physiological information about the nervous system that is now available in functional terms. Better interpretation of these data will permit neurobiology to make contact with other fields such as psychology and arti_cial intelligence. This work will make speci_c, testable predictions in the areas of sensory perception (visual, olfactory, and auditory), memory, learning, and motor control. Above all, it will lead to the integration of all these aspects to provide an eventual understanding of the total functioning of the nervous system. Such integration can be expected to provide new insights that will lead to improvements in the treatment of diseases of the nervous system at all levels, from neuropharmacology to psychotherapy. In addition, studies of this kind may be expected to contribute to major advances in arti_cial intelligence and practical robotics.

In the area of genome analysis signi_cant progress has been made over the past few years, including the use of molecular tools such as Restriction Fragment Length Polymorphism (RFLP) analysis. However, considerable effort is still required to make genetic linkage maps effective tools for genetic research. To be useful in common situations, more markers must be identi_ed and mapped to produce higher-resolution maps. In many cases marker analysis requires the ability to analyze small families and consider quantitative traits. To be fully useful in a meaningful quantitative sense this analysis will require powerful computer simulation and modeling. Common to all of the problem areas examined is the need for good visualization of data. Visualization is necessary because the sequence analysis phase for a molecular biologist is equivalent to exploratory analysis for a statistician. It is at this point that the experimentalist gains the feeling for, and understanding of, a sequence which may then guide many months of experimental work. The complexity inherent in biological systems is so great that very sophisticated methods of analysis are required. These are the tools which must be readily accessible to molecular and cellular biologists untrained in computer technology.

Ecology and evolutionary biology encompass a broad range of levels of biological organization, from the organism through the population to communities and whole ecosystems. This complexity demands computational solutions. The need for enhanced computational ability is most evident when one attempts to couple large numbers of individual units into highly interactive and largely parallel networks, whether at the tissue, community or ecosystem level of organization. The proliferation of information from remote sensing introduces the need for geographical information systems that provide a framework for classifying information, spatial statistics for analyzing patterns, and dynamic simulation models that allow the integration of information across multiple spatial, temporal, and organizational scales. Today, in these _elds application software is mostly nonexistent except in a few special special cases such as image processing and remote sensing. As more researchers begin to use computational techniques, we can expect to see a wider sharing of applications developed by an individual or small group. This will require additional resources to take research codes and make them "bullet-proof" enough for community use and to add adequate documentation. In order to take advantage of all these new capabilities, we need to increase training modalities. This can take a wide variety of forms, from on-line self training techniques to special sessions at universities, national centers, or workshops.

It is recommended that Federal granting agencies place greater emphasis on the area of Computational Biology through a number of mechanisms. This support must be developed over a period of several years with a particular emphasis on infrastructure and training. Many of the necessary changes may be instituted immediately while others will require a longer time in order to generate budgetary resources to build in new areas. The current focus on biological databases is a good beginning, however, the need is so great that the initiative needs considerable additional resources. These resources should be directed in three areas. First, the enhancement of current databases which are in wide use but need concerted effort at standardization of data structures and broadened access. Second, a continued examination of new databases which will incorporate important information needed by many investigators, but also explore new database ideas and representations. Third, research on the representation of objects and images which will be searchable and comparable within database structures. For example, there is a great need to be able to search a database of enzyme or antibody active site configurations to test for binding of newly developed ligands. Database development remains the highest priority item since this area is common to all fields of biology. A second area of high priority is the development of more powerful visualization tools for data interpretation. This area too is a need shared by almost all fields of biology. Funding agencies could immediately respond to some of the needs of the research community by recognizing the need for professional programmers and hardware and software facilities on grants in this area. Agencies must break out of the habit of immediately removing these items from budget requests in order to reduce the overall cost of an award, since these items are critical not only to doing the proposed work but also to making the results of the work (in the form of usable source code) available to the rest of the research community.

Table of Contents

Appendix 3