| You are here Glossary homepage/Search
> Informatics > Bioinformatic Bioinformatics
glossary Knowledge from many scientific
disciplines and their subfields has to be integrated to achieve the goals
of bioinformatics. The heterogeneity [of contributing fields] inhibits integration, often because their terminologies differ.
We may wish to overcome the problems
of heterogeneity by having standards. ... But standards require stability. Yesterday’s
technological innovations have become today’s infrastructure. ... Effective recall on the world- wide- web is limited
by a flood of irrelevant, obsolete, and even wrong information. ["Bioinformatics:
Converting Data to Knowledge" Gio Wiederhold introductory remarks National
Research Council workshop Feb. 16, 2000 Washington DC, US] http://www-db.stanford.edu/pub/gio/2000/NRC2001.htmRelated glossaries include Applications: Drug
Discovery & Development, Functional
genomics, Sequencing.
Informatics: Algorithms
& data management, Chemoinformatics,
Computers & computing, Databases
& software directory, Molecular Modeling,
Biology: Expression, Sequences,
DNA & beyond Organizations appear in the In-depth
glossary, after the Bibliography. annotation: The annotation process identifies sequence features on the contigs - such as variation,
sequence tagged sites, FISH mapped clone regions, known and predicted genes, and gene models. This stage
provides contig, mRNA, and protein records with added feature annotation. [NCBI Contig Assembly and Annotation Process,
2001] http://www.ncbi.nlm.nih.gov/genome/guide/build.html#contig The elucidation and description of biologically relevant features in the sequence
is essential in order for genome data to be useful. The quality
with which annotation is done will have direct impact on the value of the
sequence. At a minimum, the data must be annotated to indicate the existence
of gene coding regions and control regions. Further annotation activities
that add value to a genome include finding simple and complex repeats,
characterizing the organization of promoters and gene families, the distribution
of G + C content, and tying together evidence for functional motifs and homologs.
[Lawrence Berkeley Lab, US "Advanced Computational Structural Genomics"] http://cbcg.lbl.gov/ssi-csb/Meso.html Explanatory notes, comments, analysis and commentaries added to a database.
May refer to sequence data or protein structures and includes predictions, characterizations,
summaries, and other detailed information, including gene function. Annotation can be manual (as in SWISS- PROT) or automated (as in TrEMBL).
Since annotation is highly skilled and labor intensive, efforts are being
made to automate the process, at least for preliminary data. Related terms curated
databases. Related term Genetic
Variations glossary In-depth Genetic Annotation Initiative Narrower term
annotation - proteins, comparative genome annotation, genome annotation. annotation- proteins: Proteomics glossary bioinformatics: In order for genomics to provide novel and validated targets and provide the basis for personalized and predictive medicine, biological pathways will need to be
mapped and understood. Significant challenges are to be met in understanding the pathways that exist in gene regulation and the expression and utility of
proteins. More robust computational methods are required to analyze gene expression data for higher accuracy and predictive value. SNP data is being
investigated for the purpose of being able to assess genetic variation in population studies, in developing personalized medicine. The importance of protein
structure modeling through ab initio and homology methods will be important to facilitate the functional annotation of genes and to aid rational drug
design Integrative
Bioinformatics: High-Throughput Interpretation of Pathways and Biology
Jan. 16-18, 2002 Zurich, Switzerland Genomics and proteomics are continuing to ramp up the speed with which new data is generated on
expression, structure, interaction and
function. Each of these types of data present challenges in terms of interpretation, but the biggest challenge lies in finding better ways to integrate different types of data. This is particularly true as more researchers take a
systems biology approach, looking for comprehensive scope. Another huge challenge is to find efficient ways to cut through the noise of variability in comprehensive data to focus in on what is most relevant to a specific disease,
pathway or therapeutic target. This conference will focus on the tools needed to systematically link different types of datasets, and
annotate the experimental data with clinical information. Bioinformatics:
Beyond genome June 4- 5, 2002, San Diego, CA Research, development or application of computational tools and approaches
for expanding the use of biological, medical, behavioral or health data,
including those to acquire, store, organize, archive, analyze, or visualize such
data. ["Bioinformatics at the NIH, 2001] http://grants.nih.gov/grants/bistic/bistic.cfm The discipline of storing, retrieving,
analyzing, and integrating biological data. The field currently
encompasses protein structure analysis, gene and protein functional information,
data from patients, pre- clinical and clinical trial information, and studies of
metabolic pathways in numerous species. Bioinformatics will be one of the keys to success for companies applying genomic tools to
drug discovery and development. Demand for greater flexibility, better integration, and
higher- value analytical tools is increasing. As a result, a growing number of companies are competing in this field, with a wider range of offerings and business models. During this, the "functional" and
"high- throughput" phase of genomics, having top- level software products is simply not enough. The most promising contenders offer not just excellent applications but also access to databases
and/ or consulting services.
[CHI Bioinformatics] Study of information content and information flow in biological systems
and processes … the bridge between observations (data) in diverse
biologically- related
disciplines and the derivations of understanding (information) about how
the systems or processes function … A more pragmatic definition in the
case of diseases is the understanding of dysfunction (diagnostics) and
the subsequent applications of the knowledge for therapeutics and prognosis. [Hwa Lim,
D'Trends, Inc. to Mary Chitty, personal communication,
Jan 2000] We have coined the term Bioinformatics for the study of informatic processes
in biotic systems. Our Bioinformatic approach typically involves spatial, multi-
leveled models with many interacting entities whose behavior is determined
by local information. [Theoretical Biology Group, Univ. of Utrecht, Netherlands,
Paulien Hogeweg Director] http://www-binf.bio.uu.nl/ Original definition was “the study of informatic processes in biotic
systems” [Paulien Hogeweg MIRROR beyond MIRROR, puddles of LIFE, in Artificial
Life, ed. C.G. Langton, Addison Wesley, 297-316, 1988] Related term data mining. Algorithms &
data management glossary biomedical computing: Computers & computing
glossary biosemiotics: http://www.gypsymoth.ento.vt.edu/~sharov/biosem/biosem.html#topics CORBA: Computers & computing glossary comparative genome annotation: The major immediate interests of
the genome projects are in the identification of protein coding regions.
However, a complete description of gene structure necessitates identification of
the associated sites which signal the different processes in the gene to protein
pathway. Such sites include promoters, transcription start and end
points, poly-adenylation sites, splice sites, and translation
start and stop sites. In addition, regulatory regions form an important
functional component of gene structure. Indeed, gene regulation may utilise
alternatives in promoters, splice sites and translation start sites. Accurate
identification of coding regions is aided by the identification of such sites,
and vice versa3. Identification of
regulatory sites is more accurate when they are viewed in the context of other
surrounding elements. [Briefings in Bioinformatics" special
issue, proceedings from the symposium on "Genome Based Gene Structure
Determination" conducted at the EMBL European Bioinformatics Institute
(EBI) during June 1-2, 2000] http://industry.ebi.ac.uk/~thanaraj/BIB_Editorial.htm computational biology: The development and application of data -
analytical and theoretical methods, mathematical modelling and computational
simulation techniques to the study of biological, behavioral, and social
systems. ["Bioinformatics at the NIH, 2001] http://grants.nih.gov/grants/bistic/bistic.cfm A field of biology concerned with
the development of techniques for the collection and manipulation of
biological data, and the use of such data to make biological discoveries
or predictions. This field encompasses all computational methods and theories
applicable to molecular biology and areas of computer- based techniques
for solving biological problems including manipulation of models and datasets.
[MeSH] controlled vocabulary: Computers & computing
glossary curated databases: Often less complete than primary databases, but
they have less redundancy and the added value of scientific annotation;
therefore, a biologically significant sequence should be easier to find in such
a database and of greater value. Naturally, the degree of redundancy and
annotation in such a database depends on the experience, skills, aims, and
devotion of its curators. ... The only proper way to curate databases is the way groups like those that
developed OMIM, SWISS- PROT and most commercial databases have done it—that
is, through making scientific judgments as data are cleaned up and merged. [CHI
Bioinformatics] Under the supervision of a curator. Examples
of curated databases are LocusLink, OMIM (Online Mendelian Inheritance
in Man), RefSeq, SGD (Saccharomyces cerevisae Genome Database) and SWISS-
PROT. data mining: Algorithms & data
management glossary databases: Collections of data in machine-readable form, which
can be manipulated by software to appear in varying arrangements and subsets.
[CHI Bioinformatics] Genetic information is stored in different ways in
different databases, which makes it hard to compare their holdings. So
while computational biologists are trying to improve the quality of the
databases, they are also working to build bridges between them. So
far, they have had only limited success … each database has its own Web
site with unique navigation tools and data storage formats that make such
searching difficult … programs can’t easily recognize data that are not
stored in a uniform way. [Elizabeth Pennisi “Seeking Common language in a Tower
of Babel” Science: 449 Oct. 15 1999] Databases
& software directory describes and provides links
to around 200 databases and about 30 software tools. Narrower terms include annoted
databases, curated databases, federated databases, integrated databases,
interoperability, non- redundant databases, proprietary databases, redundant
databases, relational databases. In-depth flat files, indexed flat files. federated databases: An integrated repository data from of multiple, possibly heterogeneous, data sources presented with consistent and
coherent semantics. They do not usually contain any summary data, and all of the data resides only at the data source (i.e. no local storage).
[Lawrence Berkeley Lab "Advanced Computational
Structural Genomics" Glossary] http://cbcg.lbl.gov/ssi-csb/Meso.html federated information systems. Their main characteristic is that they
are constructed as an integrating layer over existing legacy applications and
databases. They can be broadly classified in three dimensions: the degree of
autonomy they allow in integrated components, the degree of heterogeneity
between components they can cope with, and whether or not they support
distribution. Whereas the communication and interoperation problem has come into
a stage of applicable solutions over the past decade, semantic data integration
has not become similarly clear. [Susanne Busse et. al "Federated
Information Systems: Concepts, Terminology and Architecture"
Computergestützte Informations Systeme CIS, Berlin, Germany 1999] http://citeseer.nj.nec.com/busse99federated.html functional genomic data: Functional
genomics glossary Gene OntologyTM (GO): Functional
genomics glossary Broader term ontology. genome annotation: It is now apparent that the bottleneck in genomics
is no longer in sequencing the genomes, but lies in their annotation.
Large-scale annotation efforts require handling massive amounts of genome data
through automated pipelines, with a need to combine diverse sources of data and
methods. In addition, it requires visualisation tools to manually examine the
automatic annotation, since integration of human expertise to assess the
validity and authenticity of all computational results goes a long way to
improve the quality of gene annotation. The "Annotation Jamboree", a
collaboration between Celera, the Berkeley Drosophila Genome Project, and
a team of experts on the annotation of the Adh region of Drosophila,
is an exemplary attempt on how to transform the process of manual annotation
into a high- throughput operation. [Paradigm Shifts in the Approaches for Gene
Annotation, a special issue of "Briefings in Bioinformatics" which
reports on the proceedings from the recently concluded symposium on "Genome
Based Gene Structure Determination" conducted at the EMBL European
Bioinformatics Institute (EBI) during June 1-2, 2000.] http://industry.ebi.ac.uk/~thanaraj/BIB_Editorial.htm.
Related terms ant analysis" genomic data: Genomics glossary integrated databases: Integration [of databases] typically is
accomplished by creating small, object-oriented software elements, or “wrappers”
that let a single overlaying, often browser like, desktop application interact
with all the pieces. The original separate systems are intact and
functional, and new ones can be added, while the underlying complexity
is transparent to users. There are still many challenges … but computing
environments are becoming more unified, flexible and expandable. [A. Thayer
“Bioinformatics for the Masses” Chemical & Engineering News 78(6):
19-32 Feb. 7, 2000] See also Gene Ontology; In- depth Bio-Ontology Standards
Group, Data Model Standards Group, In-depth Bioinformatics glossary Information in OMIM [Online Mendelian Inheritance in Man] and the published working draft of the International
Human Genome Sequencing Consortium (Nature 15 Feb. 2001) has been facilitated
by ties to NCBI's RefSeq and LocusLink databases. Are there other good
examples of integrated databases? integration (of databases): Allows researchers to increase the value
they get from the data, because it increases the base of information they can
access and allows for more robust searching. [CHI Bioinformatics] Related
terms include middleware, Object Oriented modeling OOM, In-depth object
protocol model OPM; Maps genomic & genetic memory mapped data structures interoperability: Computers & computing
glossary memory-mapped data structures: In this approach [to data- level
integration without semantic cleaning] subsets of data from various sources
are collected, normalized, and integrated in memory for quick access. While
this approach performs actual data integration and addresses the problem
of poor performance in the federated approach, it requires additional
calls to traditional relational databases to integrate descriptive data.
While data cleaning is being performed on some of the data sources, it
is not being done across all sources or in the same place. This makes it
difficult to quickly add new data sources. [Approaches to Integrating Biological Data, K. Griffiths, R. Resnick, NetGenics,
Inc. Intelligent Systems in Molecular Biology, August 19-23, 2000 La Jolla
CA, US] http://ismb2000.sdsc.edu/tutorials/griffiths.html metadata: Computers & computing glossary middleware: Computers & computing
glossary modularity: Ensures that, for the particular task at hand, the data
will be collected and stored in an appropriate manner - which differs greatly
from one level of activity (simply gathering the raw data) to another (storing
analyzed data) and from one type of high- throughput system to another. ... The
best system is one that employs integration at those levels where it is an
advantage but maintains enough modularity to ensure that (1) there are no major
compromises regarding how any one type of data is handled and, (2) all the key
elements in a researcher’s information system can be adjusted or updated
independently. [CHI Bioinformatics] Related term integration,
interoperability. molecular bioinformatics: Conceptualizing biology in terms of
molecules (in the sense of physical- chemistry) and then applying
"informatics" techniques (derived from disciplines such as applied
math, CS [computer science] and statistics to understand and organize the
information associated with these molecules on a large- scale. [Mark Gerstein
"What is Bioinformatics?" MB&B 474b3, 2001] http://bioinfo.mbb.yale.edu/what-is-it.html non-redundant databases: Researchers at the National Center for
Biotechnology Information (NCBI) coined the term "nr" database
(nonredundant database) to refer to a database in which the obviously
redundant entries have been merged. These entries are typically those that are
100%, character-by-character identical, and algorithms exist that can remove
such redundancy. Although such a database has less redundancy than a primary
database, a substantial amount of redundancy remains, and it can be removed only
by a curator using scientific judgment. [CHI Bioinformatics] Many databases try to be “non-redundant”.
Unfortunately, biological data is too complex to fit a simple definition
of redundancy … Each “non- redundant” database has its own definition of
redundancy. [George Church Lab, Harvard Medical School, US] http://arep.med.harvard.edu/seqanal/db.html
Examples of non- redundant databases include UniGene and SWISS- PROT,
while
DDBJ/ EMBL/ GenBank are redundant databases. OMG Object Management Group: Computers &
computing glossary Object- oriented modeling OOM: Computers &
computing glossary ontology: Computers & computing glossary proprietary databases: Fee- based, copyrighted databases
(in contrast to public databases such as those at DDBJ/ EMBL/ GenBank).
Examples include Incyte's LifeSeq and Gene Logic's GeneExpress
databases. Some databases charge subscription fees to commercial
organizations, with other arrangements available to non- profits. redundant databases: When sequence databanks were first created,
primary [redundant] databases had the advantage of being more comprehensive than
curated databases and more likely to contain recently discovered sequences.
However, redundancy is no longer much of an advantage. In a highly redundant
database, biologically significant results are more likely to be hidden among
large numbers of irrelevant reported matches. [CHI Bioinformatics] Related term non-
redundant databases relational databases: Most or all of the data are structured. These
files are the hardest to set up and maintain, and require specific knowledge by
a searcher, but they are the easiest to use when doing analysis or integration.
Data is categorized by specific fields, and so, by knowing the fields one should
be able to capture all the relevant data, quite easily. The searchability of a
relational database is totally dependent on how well the database has been
structured. [CHI Bioinformatics] schema (plural schemata): A description of the data represented
within a database. The format of the description varies but includes a
table layout for a relational database or an entity- relationship diagram.
[Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary]
Narrower term global schema. Algorithms
& data management glossary http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905 standards: Related terms CORBA In- depth Bio-ontology Standards Group,
,Data
Model Standards Group, object protocol model OPM . EBI is also
working on standards. structural bioinformatics: Structural
genomics glossary taxonomies: Computers & computing glossary throughput: Bioinformatics is currently undergoing dramatic changes, as
high- throughput laboratory methods lead to changes in key approaches, including
sequence analysis,
gene expression analysis, protein expression analysis, and
protein structure prediction and
modeling. [CHI Bioinformatics
Report] http://www.chireports.com/pressrelease/bioinformatics.asp Gene expression monitoring has provided valuable insight into biological mechanisms. In the future, efforts to
map all proteins and identify interactions will likewise provide useful clues to biologists. Computational methods are needed to accomplish this in high throughput, without requiring the identification of individual
genes but rather gene pathways and regulatory circuits. The complexity of a biological system can be reproduced by carefully observing the patterns of gene and protein expression and correlating these patterns with factors that cause
perturbation. (CHI Integrated Bioinformatics Jan. 2001, Zurich Switzerland] http://www.healthtech.com/2001/bne/
Related terms Functional genomics
glossary; systems biology Structural
genomics glossary structural proteomics. Bibliography [CHI Bioinformatics]: Getting Results in the Era of High- Throughput Genomics, April
2001. http://www.chireports.com/content/reports/bioinformatics.asp Alpha glossary index In-depth Bioinformatics
glossary BISTI: Biomedical Information Science and Technology Initiative
NIH Working Group on Biomedical Computing made recommendations in their
June 3, 1999 report on creating a national bioinformatics infrastructure.
BISTI recommendations, April 2000 http://grants.nih.gov/grants/bistic/bisti_recommendations.cfm BISTI Consortium: Established in May 2000 to serve as the focus
of biomedical computing issues at the NIH and to facilitate implementation
of the BISTI recommendations. The Consortium is composed of senior-level
representatives from the NIH centers and institutes and representatives
of other Federal agencies concerned with bioinformatics and computational
applications. The mission of the BISTI Consortium is to make optimal use
of computer science and technology to address problems in biology and medicine
by fostering new basic understandings, collaborations, and transdisciplinary
initiatives between the computational and biomedical sciences. http://grants.nih.gov/grants/bistic/bistic2.cfm BSML Bioinformatic Sequence Markup Language: An XML application
from Visual Genomics. It attempts to address the problems of comparing
genetic data from multiple sources and platforms for effective management,
communication and interactive visualization of bioinformatic data. [Online
Journal of Bioinformatics 1(1): 1-13, 2000] http://www.cpb.uokhsc.edu/ojvr/xmlpaper.html biocorba.org: Provides an object-oriented, language neutral,
platform independent method for describing and solving bioinformatic problems.
BioCORBA's mission is to leverage the code of the other Bio projects in
a simple and easy to use fashion. For example language neutral environment
allows users to write programs using BioPython and access BioPerl modules
through the CORBA server. http://www.biocorba.org/ biojava.org: An open-source project dedicated to providing Java
tools for processing biological data. This will include objects for manipulating
sequences, file parsers, CORBA interoperability, access to ACeDB, dynamic
programming, and simple statistical routines. The BioJava library
is useful for automating those daily and mundane bioinformatics tasks.
http://www.biojava.org/ Bio-ontology Standards Group: There is currently an effort underway
to standardize domain-specific ontologies and vocabularies to support interoperability
of data and software components. [Approaches to Integrating Biological
Data, K. Griffiths, R. Resnick, NetGenics, Inc. Intelligent Systems in
Molecular Biology ISMB August 19-23, 2000 La Jolla CA, US] http://ismb2000.sdsc.edu/tutorials/griffiths.html bioperl.org: An international association of developers of open
source Perl tools for bioinformatics, genomics and life science research.
We work closely with our friends and colleagues at biojava.org, biopython.org
and
bioxml.org. The Bioperl server provides an online resource for modules,
scripts, and web links for developers of Perl-based software for
life science research. http://bio.perl.org/ biopython.org: An international association of developers of
freely available Python tools for computational molecular biology. biopython.org
provides an online resource for modules, scripts, and web links for developers
of Python-based software for life science research.
http://www.biopython.org/ BioWidget Consortium Home Page, Computation Biology & Informatics
Lab, Univ. of Pennsylvania, US. The bioWidgets toolkit is a collection
of Java Beans (used for development of graphics applications and/or applets
in the genomics domain). http://www.cbil.upenn.edu/bioWidgets/ bioxml.org: A resource to gather XML documentation, DTDs and
tools for biology in one central location. It overlaps in interest
and in tools with the BioPerl project, which also hosts a page about
XML.
Our goal is to provide the biology community with a set of standard xml
tags to facilitate data exchange and a set of parsers for these tags in
a variety of popular languages. In particular, we would like to build
parsers that work with the bioperl, biopython and
biojava
projects. http://www.bioxml.org Data Model Standards Group: There is currently an effort
underway to standardize domain- specific analytical data models to help
integrate public data with proprietary data across all life science domains
in an enterprise. The past, present, and future of this group will be discussed
[at Intelligent Systems in Molecular Biology August 2000]. [Approaches
to Integrating Biological Data, K. Griffiths, R. Resnick, NetGenics, Inc.
Intelligent Systems in Molecular Biology ISMB August 19-23, 2000 La Jolla
CA, US] http://ismb2000.sdsc.edu/tutorials/griffiths.html EBI: European Bioinformatics Institute, Hinxton, Cambridge, UK.
An EMBL outstation. http://www.ebi.ac.uk/ Ensembl: A joint project between EMBL- EBI and the Sanger Centre
(UK) to develop a software system which produces and maintains automatic
annotation
on eukaryotic genomes. Human data are available now; they hope to add mouse data
soon. http://www.ensembl.org/index.html flat files: Pure text documents that are totally unstructured. This
type of file generally does not provide very specific search answers, but it is
the most popular type of file on the Web and is now a bit easier to search,
thanks to the use of hyperlinks. [CHI Bioinformatics] Narrower term: indexed
flat files. Related term: relational databases Genome Annotation Data Warehouse: A computational annotation
pipeline is being applied to the genome sequences of human, mouse, and
over 23 other organisms. This analysis integrates experimental data and
predictions around a genome sequence framework. The data is
periodically obtained from the GenBank/ EMBL/ DDBJ collaboration and
processed through a large- scale computational framework consisting of several
analysis modules. . [Annotated Genomes, Oak
Ridge National Lab, TN, US] http://genome.ornl.gov/GCat/ indexed flat files IFFs: Partially structured databases, which may
include a thesaurus (adding the ability to search synonyms) or other basic
search tools. ... IFFs, meanwhile, allow users to interactively navigate among
entries in several different databases by means of hypertext links. IFFs do not,
however, allow true database integration, and gathering information from these
types of files is often haphazard: Because the data are not really structured,
researchers may end up with many incorrect matches to their queries. The
principal advantage of this technology is that it is cheap and easy to
understand. [CHI Bioinformatics] Interoperable Informatics Infrastructure Consortium I3C: Develops
common protocols and interoperable technologies (specifications and guidelines)
for data exchange and knowledge management for the life science community. The
mission of I3C is to facilitate and enable data exchange, data management, and
knowledge management across the entire life science community by promoting
common protocols that ensures interoperability in an open, consistent and robust
manner. http://www.i3c.org/ LSR [Life Sciences Research] group: Focused on the
use of CORBA for objects at all levels of software systems for life sciences
research. CORBA is implementation language and platform- independent,
so specifications adopted by the LSR group can be implemented in the most
appropriate language(s) on a variety of hardware and operating systems.
Part of OMG. http://www.omg.org/homepages/lsr/FAQ.html#LSR
vs BW NCBI National Center for Biotechnology Information:
Established
in 1988 as a national resource for molecular biology information,
NCBI creates public databases, conducts research in computational
biology, develops software tools for analyzing genome data, and disseminates
biomedical information - all for the better understanding of molecular
processes affecting human health and disease. Part of NIH. http://www.ncbi.nlm.nih.gov Object- Protocol Model (OPM): Developed initially by members of the
Data Management Research and Development Group at Lawrence Berkeley National
Laboratory ... aim to support rapid development of complete database systems,
construction of powerful system- independent query interfaces on top of
relational and flat- file data resources, integration of heterogeneous data
resources and applications into a common object- oriented framework, deployment
of configurable Web- based query interfaces for single or multiple
databases. [CHI Bioinformatics] Open Bioinformatics Foundation
OPEN-BIO: The purpose of the foundation is to act as an umbrella
organization for the various bio*.org projects that grew out of the original BioPerl
project. The goal of the foundation is to provide financial, administrative and
technical assistance for our various open source life science projects. http://open-bio.org/ |