You are here Glossary homepage/Search > Informatics > Bioinformatic
 
Bioinformatics glossary
Evolving terminology for emerging technologies
Suggestions? Comments? Questions? mchitty@healthtech.com
Knowledge from many scientific disciplines and their subfields has to be integrated to achieve the goals of bioinformatics. The heterogeneity [of contributing fields] inhibits integration, often because their terminologies differ. We may wish to overcome the problems of heterogeneity by having standards. ... But standards require stability. Yesterday’s technological innovations have become today’s infrastructure. ... Effective recall on the world- wide- web is limited by a flood of irrelevant, obsolete, and even wrong information. ["Bioinformatics: Converting Data to Knowledge" Gio Wiederhold introductory remarks National Research Council workshop Feb. 16, 2000 Washington DC, US]  http://www-db.stanford.edu/pub/gio/2000/NRC2001.htm

Related glossaries include Applications: Drug Discovery & Development, Functional genomics, Sequencing. Informatics: Algorithms & data management, Chemoinformatics, Computers & computing, Databases & software directory, Molecular Modeling, Biology: Expression, Sequences, DNA & beyond  Organizations appear in the In-depth glossary, after the Bibliography.

annotation: The annotation process identifies sequence features on the contigs - such as variation, sequence tagged sites, FISH mapped clone regions, known and predicted genes, and gene models. This stage provides contig, mRNA, and protein records with added feature annotation. [NCBI Contig Assembly and Annotation Process, 2001]  http://www.ncbi.nlm.nih.gov/genome/guide/build.html#contig 

The elucidation and description of biologically relevant features in the sequence is essential in order for genome data to be useful. The quality with which annotation is done will have direct impact on the value of the sequence. At a minimum, the data must be annotated to indicate the existence of gene coding regions and control regions. Further annotation activities that add value to a genome include finding simple and complex repeats, characterizing the organization of promoters and gene families, the distribution of G + C content, and tying together evidence for functional motifs and homologs. [Lawrence Berkeley Lab, US "Advanced Computational Structural Genomics"]  http://cbcg.lbl.gov/ssi-csb/Meso.html

Explanatory notes, comments, analysis and commentaries added to a database. May refer to sequence data or protein structures and includes predictions, characterizations, summaries, and other detailed information, including gene function. Annotation can be manual (as in SWISS- PROT) or automated (as in TrEMBL).  Since annotation is highly skilled and labor intensive, efforts are being made to automate the process, at least for preliminary data. Related terms curated databases. Related term Genetic Variations glossary In-depth Genetic Annotation Initiative Narrower term annotation - proteins, comparative genome annotation, genome annotation.

annotation- proteins: Proteomics glossary

bioinformatics: In order for genomics to provide novel and validated targets and provide the basis for personalized and predictive medicine, biological pathways will need to be mapped and understood. Significant challenges are to be met in understanding the pathways that exist in gene regulation and the expression and utility of proteins. More robust computational methods are required to analyze gene expression data for higher accuracy and predictive value. SNP data is being investigated for the purpose of being able to assess genetic variation in population studies, in developing personalized medicine. The importance of protein structure modeling through ab initio and homology methods will be important to facilitate the functional annotation of genes and to aid rational drug design Integrative Bioinformatics: High-Throughput Interpretation of Pathways and Biology Jan. 16-18, 2002  Zurich, Switzerland  

Genomics and proteomics are continuing to ramp up the speed with which new data is generated on expression, structure, interaction and function. Each of these types of data present challenges in terms of interpretation, but the biggest challenge lies in finding better ways to integrate different types of data. This is particularly true as more researchers take a systems biology approach, looking for comprehensive scope. Another huge challenge is to find efficient ways to cut through the noise of variability in comprehensive data to focus in on what is most relevant to a specific disease, pathway or therapeutic target. This conference will focus on the tools needed to systematically link different types of datasets, and annotate the experimental data with clinical information. Bioinformatics: Beyond genome  June 4- 5, 2002, San Diego, CA     

Research, development or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data. ["Bioinformatics at the NIH, 2001]  http://grants.nih.gov/grants/bistic/bistic.cfm

The discipline of storing, retrieving, analyzing, and integrating biological data. The field currently encompasses protein structure analysis, gene and protein functional information, data from patients, pre- clinical and clinical trial information, and studies of metabolic pathways in numerous species.  Bioinformatics will be one of the keys to success for companies applying genomic tools to drug discovery and development.  Demand for greater flexibility, better integration, and higher- value analytical tools is increasing. As a result, a growing number of companies are competing in this field, with a wider range of offerings and business models. During this, the "functional" and "high- throughput" phase of genomics, having top- level software products is simply not enough. The most promising contenders offer not just excellent applications but also access to databases and/ or consulting services.  [CHI Bioinformatics] 

Study of information content and information flow in biological systems and processes … the bridge between observations (data) in diverse biologically- related disciplines and the derivations of understanding (information) about how the systems or processes function … A more pragmatic definition in the case of diseases is the understanding of dysfunction (diagnostics) and the subsequent applications of the knowledge for therapeutics and prognosis. [Hwa Lim, D'Trends, Inc. to Mary Chitty, personal communication,  Jan 2000]

We have coined the term Bioinformatics for the study of informatic processes in biotic systems. Our Bioinformatic approach typically involves spatial, multi- leveled models with many interacting entities whose behavior is determined by local information. [Theoretical Biology Group, Univ. of Utrecht, Netherlands, Paulien Hogeweg Director]  http://www-binf.bio.uu.nl/

Original definition was “the study of informatic processes in biotic systems” [Paulien Hogeweg MIRROR beyond MIRROR, puddles of LIFE, in Artificial Life, ed. C.G. Langton, Addison Wesley, 297-316, 1988]

Related term data mining. Algorithms & data management glossary

biomedical computing: Computers & computing glossary

biosemiotics: http://www.gypsymoth.ento.vt.edu/~sharov/biosem/biosem.html#topics

CORBA: Computers & computing glossary

comparative genome annotation:  The major immediate interests of the genome projects are in the identification of protein coding regions. However, a complete description of gene structure necessitates identification of the associated sites which signal the different processes in the gene to protein pathway. Such sites include promoters, transcription start and end points, poly-adenylation sites, splice sites, and translation start and stop sites. In addition, regulatory regions form an important functional component of gene structure. Indeed, gene regulation may utilise alternatives in promoters, splice sites and translation start sites. Accurate identification of coding regions is aided by the identification of such sites, and vice versa3. Identification of regulatory sites is more accurate when they are viewed in the context of other surrounding elements.  [Briefings in Bioinformatics" special issue,  proceedings from the symposium on "Genome Based Gene Structure Determination" conducted at the EMBL European Bioinformatics Institute (EBI) during June 1-2, 2000] http://industry.ebi.ac.uk/~thanaraj/BIB_Editorial.htm

computational biology: The development and application of data - analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioral, and social systems. ["Bioinformatics at the NIH, 2001]  http://grants.nih.gov/grants/bistic/bistic.cfm

 A field of biology concerned with the development of techniques for the collection and manipulation of  biological data, and the use of such data to make biological discoveries or predictions. This field encompasses all computational methods and theories applicable to molecular biology and areas of computer- based techniques for solving biological problems including manipulation of models and datasets.  [MeSH]

controlled vocabulary: Computers & computing glossary 

curated databases: Often less complete than primary databases, but they have less redundancy and the added value of scientific annotation; therefore, a biologically significant sequence should be easier to find in such a database and of greater value. Naturally, the degree of redundancy and annotation in such a database depends on the experience, skills, aims, and devotion of its curators.  ...  The only proper way to curate databases is the way groups like those that developed OMIM, SWISS- PROT and most commercial databases have done it—that is, through making scientific judgments as data are cleaned up and merged. [CHI Bioinformatics]

Under the supervision of a curator. Examples of curated databases are LocusLink, OMIM (Online Mendelian Inheritance in Man), RefSeq, SGD (Saccharomyces cerevisae Genome Database) and SWISS- PROT.

data mining: Algorithms & data management glossary

databases: Collections of data in machine-readable form, which can be manipulated by software to appear in varying arrangements and subsets. [CHI Bioinformatics] 

Genetic information is stored in different ways in different databases, which makes it hard to compare their holdings. So while computational biologists are trying to improve the quality of the databases, they are also working to build bridges between them.  So far, they have had only limited success … each database has its own Web site with unique navigation tools and data storage formats that make such searching difficult … programs can’t easily recognize data that are not stored in a uniform way. [Elizabeth Pennisi “Seeking Common language in a Tower of Babel” Science: 449 Oct. 15 1999]   

Databases & software directory describes and provides links to around 200 databases and about 30 software tools. Narrower terms include annoted databases, curated databases, federated databases, integrated databases, interoperability, non- redundant databases, proprietary databases, redundant databases, relational databases. In-depth flat files, indexed flat files.

federated databases: An integrated repository data from of multiple, possibly heterogeneous, data sources presented with consistent and coherent semantics. They do not usually contain any summary data, and all of the data resides only at the data source (i.e. no local storage).   [Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary] http://cbcg.lbl.gov/ssi-csb/Meso.html

federated information systems. Their main characteristic is that they are constructed as an integrating layer over existing legacy applications and databases. They can be broadly classified in three dimensions: the degree of autonomy they allow in integrated components, the degree of heterogeneity between components they can cope with, and whether or not they support distribution. Whereas the communication and interoperation problem has come into a stage of applicable solutions over the past decade, semantic data integration has not become similarly clear. [Susanne Busse et. al "Federated Information Systems: Concepts, Terminology and Architecture"  Computergestützte Informations Systeme CIS, Berlin, Germany 1999] http://citeseer.nj.nec.com/busse99federated.html

functional genomic data: Functional genomics glossary

Gene OntologyTM (GO): Functional genomics glossary Broader term ontology. 

genome annotation: It is now apparent that the bottleneck in genomics is no longer in sequencing the genomes, but lies in their annotation. Large-scale annotation efforts require handling massive amounts of genome data through automated pipelines, with a need to combine diverse sources of data and methods. In addition, it requires visualisation tools to manually examine the automatic annotation, since integration of human expertise to assess the validity and authenticity of all computational results goes a long way to improve the quality of gene annotation. The "Annotation Jamboree", a collaboration between Celera, the Berkeley Drosophila Genome Project, and a team of experts on the annotation of the Adh region of Drosophila, is an exemplary attempt on how to transform the process of manual annotation into a high- throughput operation. [Paradigm Shifts in the Approaches for Gene Annotation, a special issue of "Briefings in Bioinformatics" which reports on the proceedings from the recently concluded symposium on "Genome Based Gene Structure Determination" conducted at the EMBL European Bioinformatics Institute (EBI) during June 1-2, 2000.]  http://industry.ebi.ac.uk/~thanaraj/BIB_Editorial.htm. Related terms ant analysis"

genomic data: Genomics glossary

integrated databases: Integration [of databases] typically is accomplished by creating small, object-oriented software elements, or “wrappers” that let a single overlaying, often browser like, desktop application interact with all the pieces.  The original separate systems are intact and functional, and new ones can be added, while the underlying complexity is transparent to users. There are still many challenges … but computing environments are becoming more unified, flexible and expandable. [A. Thayer “Bioinformatics for the Masses” Chemical & Engineering News 78(6): 19-32 Feb. 7, 2000] See also Gene Ontology; In- depth Bio-Ontology Standards Group, Data Model Standards Group,  In-depth Bioinformatics glossary

Information in OMIM [Online Mendelian Inheritance in Man] and the published working draft of the International Human Genome Sequencing Consortium (Nature 15 Feb. 2001) has been facilitated by ties to NCBI's RefSeq and LocusLink databases. Are there other good examples of integrated databases?

integration (of databases): Allows researchers to increase the value they get from the data, because it increases the base of information they can access and allows for more robust searching. [CHI Bioinformatics]  Related terms include middleware, Object Oriented modeling OOM, In-depth object protocol model OPM; Maps genomic & genetic memory mapped data structures

interoperability: Computers & computing glossary

memory-mapped data structures: In this approach [to data- level integration without semantic cleaning] subsets of data from various sources are collected, normalized, and integrated in memory for quick access. While this approach performs actual data integration and addresses the problem of  poor performance in the federated approach, it requires additional calls to traditional relational databases to integrate descriptive data. While data cleaning is being performed on some of the data sources, it is not being done across all sources or in the same place. This makes it difficult to quickly add new data sources. [Approaches to Integrating Biological Data, K. Griffiths, R. Resnick, NetGenics, Inc. Intelligent Systems in Molecular Biology, August 19-23, 2000 La Jolla CA, US]  http://ismb2000.sdsc.edu/tutorials/griffiths.html

metadata: Computers & computing glossary 

middleware: Computers & computing glossary

modularity: Ensures that, for the particular task at hand, the data will be collected and stored in an appropriate manner - which differs greatly from one level of activity (simply gathering the raw data) to another (storing analyzed data) and from one type of high- throughput system to another. ... The best system is one that employs integration at those levels where it is an advantage but maintains enough modularity to ensure that (1) there are no major compromises regarding how any one type of data is handled and, (2) all the key elements in a researcher’s information system can be adjusted or updated independently. [CHI Bioinformatics] Related term integration, interoperability.

molecular bioinformatics: Conceptualizing biology in terms of molecules (in the sense of physical- chemistry) and then applying "informatics" techniques (derived from disciplines such as applied math, CS [computer science] and statistics to understand and organize the information associated with these molecules on a large- scale. [Mark Gerstein "What is Bioinformatics?" MB&B 474b3, 2001] http://bioinfo.mbb.yale.edu/what-is-it.html

non-redundant databases: Researchers at the National Center for Biotechnology Information (NCBI) coined the term "nr" database (nonredundant database) to refer to a database in which the obviously redundant entries have been merged. These entries are typically those that are 100%, character-by-character identical, and algorithms exist that can remove such redundancy. Although such a database has less redundancy than a primary database, a substantial amount of redundancy remains, and it can be removed only by a curator using scientific judgment. [CHI Bioinformatics]

Many databases try to be “non-redundant”.  Unfortunately, biological data is too complex to fit a simple definition of redundancy … Each “non- redundant” database has its own definition of redundancy. [George Church Lab, Harvard Medical School, US]  http://arep.med.harvard.edu/seqanal/db.html   Examples of non- redundant databases include UniGene and SWISS- PROT, while DDBJ/ EMBL/ GenBank are redundant databases.

OMG Object Management Group: Computers & computing glossary

Object- oriented modeling OOM: Computers & computing glossary

ontology: Computers & computing glossary

proprietary databases:  Fee- based, copyrighted databases (in contrast to public databases such as those at DDBJ/ EMBL/ GenBank).  Examples include Incyte's LifeSeq and Gene Logic's GeneExpress databases.  Some databases charge subscription fees to commercial organizations, with other arrangements available to non- profits.

redundant databases: When sequence databanks were first created, primary [redundant] databases had the advantage of being more comprehensive than curated databases and more likely to contain recently discovered sequences. However, redundancy is no longer much of an advantage. In a highly redundant database, biologically significant results are more likely to be hidden among large numbers of irrelevant reported matches. [CHI Bioinformatics] Related term non- redundant databases

relational databases: Most or all of the data are structured. These files are the hardest to set up and maintain, and require specific knowledge by a searcher, but they are the easiest to use when doing analysis or integration. Data is categorized by specific fields, and so, by knowing the fields one should be able to capture all the relevant data, quite easily. The searchability of a relational database is totally dependent on how well the database has been structured. [CHI Bioinformatics]

schema (plural schemata): A description of the data represented within a database. The format of the description varies but includes a table layout for a relational database or an entity- relationship diagram. [Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary] Narrower term global schema. Algorithms & data management glossary http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905

standards: Related terms CORBA In- depth Bio-ontology Standards Group, ,Data Model Standards Group, object protocol model OPM . EBI is also working on standards.

structural bioinformatics: Structural genomics glossary

taxonomies: Computers & computing glossary

throughput: Bioinformatics is currently undergoing dramatic changes, as high- throughput laboratory methods lead to changes in key approaches, including sequence analysis, gene expression analysis, protein expression analysis, and protein structure prediction and modeling. [CHI Bioinformatics Report] http://www.chireports.com/pressrelease/bioinformatics.asp

 Gene expression monitoring has provided valuable insight into biological mechanisms. In the future, efforts to map all proteins and identify interactions will likewise provide useful clues to biologists. Computational methods are needed to accomplish this in high throughput, without requiring the identification of individual genes but rather gene pathways and regulatory circuits. The complexity of a biological system can be reproduced by carefully observing the patterns of gene and protein expression and correlating these patterns with factors that cause perturbation. (CHI Integrated Bioinformatics Jan. 2001, Zurich Switzerland] http://www.healthtech.com/2001/bne/  Related terms Functional genomics glossary; systems biology Structural genomics glossary structural proteomics.

Bibliography

[CHI Bioinformatics]: Getting Results in the Era of High- Throughput Genomics, April 2001. http://www.chireports.com/content/reports/bioinformatics.asp

Alpha glossary index

In-depth Bioinformatics glossary

BISTI:  Biomedical Information Science and Technology Initiative NIH Working Group on Biomedical Computing made recommendations in their June 3, 1999 report on creating a national bioinformatics infrastructure. BISTI recommendations, April 2000 http://grants.nih.gov/grants/bistic/bisti_recommendations.cfm

BISTI Consortium: Established in May 2000 to serve as the focus of biomedical computing issues at the NIH and to facilitate implementation of the BISTI recommendations. The Consortium is composed of senior-level representatives from the NIH centers and institutes and representatives of other Federal agencies concerned with bioinformatics and computational applications. The mission of the BISTI Consortium is to make optimal use of computer science and technology to address problems in biology and medicine by fostering new basic understandings, collaborations, and transdisciplinary initiatives between the computational and biomedical sciences. http://grants.nih.gov/grants/bistic/bistic2.cfm

BSML Bioinformatic Sequence Markup Language: An XML application from Visual Genomics. It attempts to address the problems of comparing genetic data from multiple sources and platforms for effective management, communication and interactive visualization of bioinformatic data. [Online Journal of Bioinformatics 1(1): 1-13, 2000]   http://www.cpb.uokhsc.edu/ojvr/xmlpaper.html

biocorba.org: Provides an object-oriented, language neutral, platform independent method for describing and solving bioinformatic problems. BioCORBA's mission is to leverage the code of the other Bio projects in a simple and easy to use fashion. For example language neutral environment allows users to write programs using BioPython and access BioPerl modules through the CORBA server. http://www.biocorba.org/

biojava.org: An open-source project dedicated to providing Java tools for processing biological data. This will include objects for manipulating sequences, file parsers, CORBA interoperability, access to ACeDB, dynamic programming, and simple statistical routines.  The BioJava library is useful for automating those daily and mundane bioinformatics tasks. http://www.biojava.org/

Bio-ontology Standards Group: There is currently an effort underway to standardize domain-specific ontologies and vocabularies to support interoperability of data and software components. [Approaches to Integrating Biological Data, K. Griffiths, R. Resnick, NetGenics, Inc. Intelligent Systems in Molecular Biology ISMB August 19-23, 2000 La Jolla CA, US] http://ismb2000.sdsc.edu/tutorials/griffiths.html

bioperl.org: An international association of developers of open source Perl tools for bioinformatics, genomics and life science research. We work closely with our friends and colleagues at biojava.org, biopython.org and bioxml.org. The Bioperl server provides an online resource for modules, scripts, and web links for developers of  Perl-based software for life science research. http://bio.perl.org/

biopython.org: An international association of developers of freely available Python tools for computational molecular biology. biopython.org provides an online resource for modules, scripts, and web links for developers of Python-based software for life science research. http://www.biopython.org/

BioWidget Consortium Home Page, Computation Biology & Informatics Lab, Univ. of Pennsylvania, US.  The bioWidgets toolkit is a collection of Java Beans (used for development of graphics applications and/or applets in the genomics domain).  http://www.cbil.upenn.edu/bioWidgets/

bioxml.org: A resource to gather XML documentation, DTDs and tools for biology in one central  location.  It overlaps in interest and in tools with the BioPerl project, which also hosts a page about XML. Our goal is to provide the biology community with a set of standard xml tags to facilitate data exchange and a set of parsers for these tags in a variety of popular languages.  In particular, we would like to build parsers that work with the bioperl, biopython and biojava projects. http://www.bioxml.org

Data Model Standards Group:  There is currently an effort underway to standardize domain- specific analytical data models to help integrate public data with proprietary data across all life science domains in an enterprise. The past, present, and future of this group will be discussed [at Intelligent Systems in Molecular Biology August 2000].  [Approaches to Integrating Biological Data, K. Griffiths, R. Resnick, NetGenics, Inc. Intelligent Systems in Molecular Biology ISMB August 19-23, 2000 La Jolla CA, US] http://ismb2000.sdsc.edu/tutorials/griffiths.html

EBI: European Bioinformatics Institute, Hinxton, Cambridge, UK. An EMBL outstation.  http://www.ebi.ac.uk/

Ensembl: A joint project between EMBL- EBI and the Sanger Centre (UK) to develop a software system which produces and maintains automatic annotation on eukaryotic genomes. Human data are available now; they hope to add mouse data soon.  http://www.ensembl.org/index.html

flat files: Pure text documents that are totally unstructured. This type of file generally does not provide very specific search answers, but it is the most popular type of file on the Web and is now a bit easier to search, thanks to the use of hyperlinks. [CHI Bioinformatics] Narrower term: indexed flat files. Related term: relational databases

Genome Annotation Data Warehouse: A computational annotation pipeline is being applied to the genome sequences of human, mouse, and over 23 other organisms. This analysis integrates experimental data and predictions around a  genome sequence framework.  The data is periodically obtained from the GenBank/ EMBL/ DDBJ collaboration and processed through a large- scale computational framework consisting of several analysis modules. . [Annotated Genomes, Oak Ridge National Lab, TN, US]  http://genome.ornl.gov/GCat/

indexed flat files IFFs: Partially structured databases, which may include a thesaurus (adding the ability to search synonyms) or other basic search tools. ... IFFs, meanwhile, allow users to interactively navigate among entries in several different databases by means of hypertext links. IFFs do not, however, allow true database integration, and gathering information from these types of files is often haphazard: Because the data are not really structured, researchers may end up with many incorrect matches to their queries. The principal advantage of this technology is that it is cheap and easy to understand. [CHI Bioinformatics]

Interoperable Informatics Infrastructure Consortium I3C: Develops common protocols and interoperable technologies (specifications and guidelines) for data exchange and knowledge management for the life science community. The mission of I3C is to facilitate and enable data exchange, data management, and knowledge management across the entire life science community by promoting common protocols that ensures interoperability in an open, consistent and robust manner.  http://www.i3c.org/

LSR [Life Sciences Research] group:  Focused on  the use of CORBA for objects at all levels of software systems for life sciences research. CORBA is implementation language and  platform- independent, so specifications adopted by the LSR group can be implemented in the most appropriate language(s) on a variety of hardware and  operating systems. Part of OMG. http://www.omg.org/homepages/lsr/FAQ.html#LSR vs BW

NCBI  National Center for Biotechnology Information: Established in 1988 as a national  resource for molecular biology information, NCBI creates public databases, conducts research in  computational biology, develops software tools for analyzing genome data, and disseminates biomedical information - all for the better understanding of molecular processes affecting human health and disease. Part of  NIH. http://www.ncbi.nlm.nih.gov

Object- Protocol Model (OPM): Developed initially by members of the Data Management Research and Development Group at Lawrence Berkeley National Laboratory ... aim to support rapid development of complete database systems, construction of powerful system- independent query interfaces on top of relational and flat- file data resources, integration of heterogeneous data resources and applications into a common object- oriented framework, deployment of configurable Web- based query interfaces for single or multiple databases.  [CHI Bioinformatics]

Open Bioinformatics Foundation OPEN-BIO: The purpose of the foundation is to act as an umbrella organization for the various bio*.org projects that grew out of the original BioPerl project. The goal of the foundation is to provide financial, administrative and technical assistance for our various open source life science projects. http://open-bio.org/


Cambridge
Healthtech Institute
1037 Chestnut Street
Newton Upper Falls, Ma 02464
Phone:
617-630-1300
Fax:  617-630-1325
Email: chi@healthtech.com