| You are here Glossary homepage/Search
> Informatics > Algorithms & data management Algorithms & data management
glossary With changes in sequencing technology and methods, the rate
of acquisition of human and other genome data over the next few years will
be ~100 times higher than originally anticipated. Assembling and interpreting
these data will require new and emerging levels of coordination and collaboration
in the genome research community to develop the necessary computing algorithms,
data management and visualization system. [Lawrence Berkeley
Lab, US "Advanced Computational Structural Genomics"] http://cbcg.lbl.gov/ssi-csb/Meso.htmlRelated glossaries include Applications: Drug Discovery &
Development, Sequencing,
Informatics: Bioinformatics,
Chemoinformatics, Computers
& computing, Molecular
Modeling, Research Biology: Protein
Structures, Sequences, DNA
& beyond. Additional definitions appear in the In-depth glossary,
after the Bibliography. 3D-QSAR Three-Dimensional Quantitative Structure-Activity Relationships:
Involves the analysis of the quantitative relationship between the
biological activity of a set of compounds and their three- dimensional properties
using statistical correlation methods. [IUPAC Computational] Broader terms
QSAR; Drug
discovery & development SAR Structure Activity
Relationship Narrower terms In-depth CoMFA
Comparative Molecular Field Analysis Related term Drug
discovery & development drug design algorithm: A computable set of steps to achieve a desired result. Note: The word comes from the Persian author Abu Ja'far
Mohammed ibn Mûsâ al-Khowârizmî who wrote a book with
arithmetic rules dating from about 825 A.D. [NIST] Rules or a process, particularly in computer
science. In medicine a step by step process for reaching a diagnosis or
ruling out specific diseases. May be expressed as a flow chart in
either sense. Greater efficiencies in algorithms, as well as improvements in computer
hardware have led to advances in computational biology. Narrower terms sequencing algorithms; In-depth Bayesian
inference algorithms, genetic
algorithm, heuristic algorithm. Related terms heuristic, parsing; Sequencing
dynamic programming methods. annotation: Bioinformatics
glossary. artificial intelligence (AI): A wide- ranging term encompassing
computer applications that have the ability to make decisions; the ability
to explain reasoning is evidence of intelligence. Also covers methods
that have the ability to learn. [J Glassey et al. “Issues in the development
of an industrial bioprocess advisory system” Trends in Biotechnology 18
(4):136-41 April 2000] Or as some people have noted, laboriously trying to get computers to
do what people do intuitively, without great effort. Conversely there are things
computer can do (relatively) effortlessly such as massive numbers of
error- free calculations. The most promising applications seem to involve
incorporating both computer aided consideration of many possibilities, combined
with human judgment. Narrower terms In-depth cellular automata, expert systems, fuzzy logic, genetic algorithms, neural nets
Related term training sets. Artificial Intelligence Links American Association of Artificial Intelligence: Topics http://www.aaai.org/AITopics/html/current.html Virtual Library Artificial Intelligence, David Corne, March 1997 http://www.u.arizona.edu/~avs/ACG/AI.html
University and government research sites, newsgroups, commercial sites
and products, programming languages, journals, bibliographies, “interactive
things” and other information biometrics: The information age is quickly revolutionizing the way
transactions are completed. Everyday actions are increasingly being handled
electronically, instead of with pencil and paper or face to face. This growth in
electronic transactions has resulted in a greater demand for fast and accurate
user identification and authentication. Biometric technology is a way to achieve
fast, user- friendly authentication with a high level of accuracy. [Biometrics
Consortium] http://www.biometrics.org/REPORTS/CTSTG96/ cluster analysis: The clustering, or grouping, of large
data sets (e.g., chemical and/ or pharmacological data sets) on the basis
of similarity criteria for appropriately scaled variables that represent
the data of interest. Similarity criteria (distance based, associative,
correlative, probabilistic) among the several clusters facilitate the recognition
of patterns and reveal otherwise hidden structures (Rouvray, 1990; Willett,
1987, 1991). [IUPAC Computational] This data-analysis approach uses standard statistical algorithms to arrange
genes according to similarity in patterns of gene expression. The output is
displayed graphically, conveying the clustering and the underlying gene
expression data simultaneously. (Eisen MB, et al. "Cluster analysis and
display of genome- wide expression patterns." Proceedings of the National
Academy of Sciences, U.S.A. 1998;95:14863-14868.) Clusters, and the genes
within them, can be examined for commonalities in function or sequence to help
researchers better understand how and why they behave similarly. [CHI
Microarrays] S. cerevisiae GenomeCluster Analysis and Display of Genome-wide
Expression Patterns, Stanford Univ., US http://rana.Stanford.EDU/clustering/
An online supplement to Mike Eisen’s 1998 PNAS article [above reference] Has been used in medicine to create taxonomies of diseases and
diagnosis and in archaeology to establish taxonomies of stone tools and funereal
objects. Related terms hierarchical clustering, pattern
recognition. Narrower term k-means clustering, self- organizing maps collaborative filtering: Tools that leverage user preferences,
patterns, and purchasing behavior to customize organization and navigation
systems. [Peter Morville "Software for Information Architects" Argus
Center for Information Architecture, 2000] http://argus-acia.com/strange_connections/current_article.html Amazon's recommendations based on what other buyers of a specific title are
a familiar example of collaborative filtering. common factor analysis: See under In-depth principle component analysis PCA comparative molecular field analysis (CoMFA) is a 3D-QSAR
method that uses statistical correlation techniques for the analysis of the
quantitative relationship between the biological activity of a set of compounds
with a specified alignment, and their three-dimensional electronic and steric
properties. Other properties such as hydrophobicity and hydrogen bonding can
also be incorporated into the analysis. (See also Three-dimensional
Quantitative Structure-Activity Relationship [3D-QSAR]). [IUPAC
Medicinal Chemistry] data cleaning: Removal and/or correction of erroneous data introduced
by data entry errors, expired validity of data, or by some other means.
[Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary] http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905 The quality of data in sequence databases is highly variable. This is
receiving increasing attention. Ensembl (Bioinformatics
In-depth) differentiates data of varying quality. Related term data
reduction methods data integration: Related terms Bioinformatics
interoperability, XML data management methods: Include algorithms, artificial
intelligence, data cleaning, data mining, data reduction methods, expert
systems, factorial design, fuzzy logic, knowledge based systems, neural
networks, normalization, parsing, pattern recognition, SPC Structure- Property Correlations, visualization and
various statistical methods. In-depth CoMFA, decision tress, factorial
design, mosaic plots, multivariate statistics, Partial Least Squares PLS, Principal
Components Analysis PCA, recursive partitioning Clinical
genomics glossary meta-analysis data mart: See under data warehouse. data mining: Nontrivial extraction
of implicit, previously unknown and potentially useful information from
data, or the search for relationships and global patterns that exist in
databases. [Bob Klevecz "The Whole EST Catalog" Scientist 12 (2): 22 Jan
18 1999] Exploration and analysis, by automatic
or semi-automatic means, of large quantities of data in order to discover
meaningful patterns or rules. [Berry, MJA, Data Mining Techniques for
Marketing, Sales and Customer Support John Wiley & Sons, New York
1997 cited in Nature Genetics 21(15): 51-55 ref 11, 1999] May need to incorporate related techniques such as cluster analysis or
visualization. Narrower terms In-
depth affinity based data mining, comparative data mining,
influence-based data mining, predictive data mining, text mining, time delay data mining,
trends analysis
data mining. Imaging image
data mining. Related terms data warehouse data reduction methods: Includes cluster analysis, currently the best
known data reduction method in the microarray field. [CHI Bioinformatics] Related term data cleaning data warehouse: An integrated repository of data from multiple,
possibly heterogeneous data sources, presented with consistent and coherent
semantics. Warehouses usually contain summary information represented on
a centralized storage facility. [Lawrence Berkeley Lab "Advanced Computational
Structural Genomics" Glossary] http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905 The term was coined by W. H. Inmon. .. Typically, a data
warehouse is housed on an enterprise mainframe server. ... Data warehousing
emphasizes the capture of data from diverse sources for useful analysis
and access, but does not generally start from the point- of- view of
the end user or knowledge worker who may need access to specialized, sometimes
local databases. The latter idea is known as the data mart. [whatis.com] Related
terms data mining, global schema evolutionary computation methods: Include genetic
algorithms (GAs) or genetic programming (GPs) which may make it possible to
discriminate between common infectious agents, monitor complex industrial
bioprocesses, and detect specific chemical biomarkers in bacteria. Roy Goodacre
"Evolutionary Computation for Interpretation of Metabolomic
Data" Metabolic Profiling Dec. 3-4, 2001 Chapel Hill, NC experimental design: The use of mathematical and statistical
methods to select the minimum number of experiments or compounds for optimal
coverage of descriptor or variable space. [IUPAC Computational] functional genomic data: Functional
genomics glossary genetic algorithm GA : Method for library design by evaluating
the fit of a parent library to some desired property (e.g. the level of
activity in a biological assay, or the computationally determined diversity
of the compound set) as measured by a fitness function. The design of more
optimal daughter libraries is then carried out by a heuristic process
with similarities to genetic selection in that it employs replication, mutation,
deletions etc. over a number of generations. [IUPAC Combinatorial
Chemistry] An optimization algorithm based on the mechanisms of Darwinian evolution
which uses random mutation, crossover and selection procedures to breed
better models or solutions from an originally random starting population
or sample. (Rogers and Hopfinger, 1994). [IUPAC Computational Genetic Algorithms Archive http://www.aic.nrl.navy.mil/galist/. Related terms
Computers & computing evolutionary computation ; Drug
discovery & development drug design genome mining: In an initial data- mining effort, the draft human genome was searched to find paralogs of known tumor suppressor genes, and for gene arrangements, which are typical of oncogenes, in cancer cells. The results were disappointing, indicating that although knowledge of the human genome will undoubtedly be of great help, other approaches to identify new oncogenes are needed.
[TG Boyer et. al. "Genome mining for human cancer genes: wherefore art thou?"
Trends in Molecular Medicine 7 (5) : 187- 189, May 2001] http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11325617&dopt=Abstract genomic data: Genomics glossary global schema: A schema, or a map of the data content of a data
warehouse that integrates the schemata from several source repositories.
It is "global", because it is presented to warehouse users as the schema
that they can query against to find and relate information from any of
the sources, or from the aggregate information in the warehouse. [Lawrence
Berkeley Lab "Advanced Computational Structural Genomics" Glossary] Broader term
schema. http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905 heuristic: Tools such as genetic algorithms or neural
networks employ heuristic methods to derive solutions which may be
based on purely empirical information and which have no explicit rationalization.
[IUPAC Combinatorial Chemistry] Trial and error methods. Narrower term: In-depth heuristic
algorithm Hidden Markov Models HMM: Molecular
modeling glossary Related term simulated annealing hierarchical clustering: An unsupervised clustering approach that has
been proven valuable for recognizing patterns in the gene expression data.
However, since the output of hierarchical clustering is a tree- like structure,
rather than separate clusters as produced by self- organizing map (SOM),
it is often arbitrary to determine whether two subtrees should belong to a
single cluster. [Jinfeng Liu "Analysis of yeast microarray
gene expression data: Hierarchical Clustering and Self- Organizing Map" Columbia
Univ. Bioinformatics 2000] http://cubic.bioc.columbia.edu/~liu/Project2/proposal.html knowledge based systems: An extension of the expert system concept
wherein additional forms of knowledge, such as mathematical models, are
incorporated with the expert rules. [J Glassey et al. “Issues in the development
of an industrial bioprocess advisory system” Trends in Biotechnology 18
(4):136-41 April 2000] Related term data mining. knowledge management: The ability to navigate through and analyze
large amounts of data, and to ensure a flow of knowledge to the right people at
the right time, is emerging as a major competitive advantage. This is especially
critical as companies seek to exploit emerging technologies, coordinate research
and activities across large organizational and geographic distances, and manage costs and projects effectively. Knowledge management systems are designed to
capture much of the tacit capabilities of an organization, including the skills
and experience of its employees. [CHI Summit Report, Transforming the Pharmaceutical Industry –
The Industrialization of Research and New Market Realities ] Knowledge Management link Virtual Library: Knowledge Management, May 2000 http://www.brint.com/km/
Definition, articles, white papers, interviews, business and technology
library, periodicals and publications, “out of box thinking”, “movers and shakers”, “think tank”, calendar of events, emerging topics. lexical parsing: See under parsing. neural networks: Technique for optimizing a desired property
given a set of items which have been previously characterized with respect
to that property (the 'training set'). Features of members
of the training set which correlate with the desired property are 'remembered
and used to generate a model for selecting new items with the desired property
or to predict the fit of an unknown member. [IUPAC Combinatorial Chemistry] Communication between statisticians and neural net researchers is often
hindered by the different terminology used in the two fields. There is a
comparison of neural net and statistical jargon in ftp://ftp.sas.com/pub/neural/jargon
[Neural Network FAQ Part 1] ftp://ftp.sas.com/pub/neural/FAQ.html Narrower term In-depth artificial neural networks. Often uses In-depth
fuzzy logic. Related terms artificial intelligence; Molecular
modeling glossary self- organizing maps normalization: In creating a database,
normalization is the process of organizing it into tables in such a way that the
results of using the database are always unambiguous and as intended.
Normalization may have the effect of duplicating data within the database and
often results in the creation of additional tables. (While normalization tends
to increase the duplication of data, it does not introduce redundancy, which is
unnecessary duplication.) Normalization is typically a refinement process after
the initial exercise of identifying the data objects that should be in the
database, identifying their relationships, and defining the tables required and
the columns within each table. [whatis.com] parsing: Using algorithms to analyze data into components. Semantic
parsing involves trying to figure out what the components mean. Lexical
parsing refers to the process of deconstructing the data into components. pattern recognition (PR): The identification of patterns in large
data sets using appropriate mathematical methodologies. Examples
are principal component analysis (PCA), SIMCA, partial least squares
(PLS) and artificial neural networks (ANN) (Rouvray, 1990;
Van de Waterbeemd, 1995ab) [IUPAC Computational] Narrower terms In-
depth artificial neural networks, molecular
pattern recognition, principal component analysis (PCA), SIMCA, partial least squares
(PLS) probability: Probability web http://www.mathcs.carleton.edu/probweb/probweb.html protein and mRNA data: Proteomics glossary Quantitative Structure-Activity Relationships QSAR: Mathematical relationships linking chemical structure and pharmacological
activity in a quantitative manner for a series of compounds. Methods which
can be used in QSAR include various regression and
pattern recognition
techniques. QSAR is often taken to be equivalent to chemometrics or multivariate
statistical data analysis. It is sometimes used in a more limited
sense as equivalent to Hansch analysis. QSAR is a subset of the more general
term SPC. [IUPAC Computational] The building of structure – biological activity
models by using regression analysis with physicochemical constants,
indicator variables or theoretical calculations. The term has been extended
by some authors to include chemical reactivity, i.e. activity is regarded
as synonymous with reactivity. This extension is, however, discouraged. Related
term correlation analysis. [IUPAC Compendium] Related terms SAR Structure Activity Relationship; In-depth Hansch
analysis; Drug discovery &
development drug design regression analysis: The use of statistical methods for
modeling a set of dependent variables, Y, in terms of combinations of
predictors, X. It includes methods such as multiple linear
regression (MLR) and partial least squares (PLS). [IUPAC Computational] regression to the mean: A common misconception about genetics has to
do with overgeneralization about the likelihood of increased quality by selective breeding.
Two very tall parents will tend to produce offspring who are taller than the
average population -- but less tall than the average of the parents'
heights. Or as George Bernard Shaw is supposed to have said to a famous
beauty who suggested they have a child ""With your brains and my looks
..." He retorted, "But what if the child had my looks and your
brains?" robust: A statistical test
that yields approximately correct results despite the falsity of certain
of the assumptions on which it is based [OED] Hence, can refer to
a process which is relatively insensitive to human foibles and variables
in the way (for example, an assay) is carried out. SAR Structure Activity Relationship: Drug
discovery & development Narrower terms 3D-QSAR, QSAR SPC Structure-Property Correlations: All statistical mathematical
methods used to correlate any molecular property (intrinsic, chemical or
biological) to any other property, using statistical regression or pattern
recognition techniques (Van de Waterbeemd, 1992). QSAR is a
subset of the more general term SPC. [IUPAC Computational] Narrower terms: 3D QSAR, QSAR schema: Bioinformatics glossary
Narrower term In-depth global schema self- organizing map: A type of mathematical cluster analysis
that is particularly well suited for recognizing and classifying features
in complex, multidimensional data. The method has been implemented in a
publicly available computer package, GENECLUSTER, that performs the analytical
calculations and provides easy data visualization … Expression patterns
of some 6,000 human genes were assayed, and an online database was created.
GENECLUSTER was used to organize the genes into biologically relevant clusters
that suggest novel hypotheses about hematopoietic differentiation. [P. Tamayo
et al “Interpreting patterns of gene expression with self- organizing maps:
methods and application to hematopoietic differentiation” PNAS 96(6):2907-12
Mar 16, 1999] self organization: A process where the organization (constraint, redundancy) of a system spontaneously increases, i.e. without this increase being controlled by the environment or an encompassing or otherwise external system.
[F. Heylighen, "Self Organization" Jan 27, 1997
in: F. Heylighen, C. Joslyn and V. Turchin (editors): Principia Cybernetica Web (Principia
Cybernetica, Brussels)] Related term neural networks http://pespmc1.vub.ac.be/SELFORG.html semantic parsing: See under parsing. sequencing algorithms: See BLAST, FASTA, Needleman - Wunsch,
Smith - Waterman Sequencing Glossary
In-depth stochastic: "Aiming, proceeding by guesswork" (Webster's Collegiate
Dictionary). Term which is often applied to combinatorial processes involving
true random sampling, such as selection of beads from an encoded library,
or certain methods for library design. [IUPAC COMBINATORIAL CHEMISTRY] Truly random, based on probability. text mining: Using data mining on unstructured data, such as the
biomedical literature. Related term Computers
& computing natural language processing training set: Rule based example sets. Related term neural
networks. visualization: Among the most significant unmet needs
in bioinformatics are for improved visualization and data- mining software. Now
that researchers are regularly dealing with hundreds of thousands to millions of
data points, visualization is critical. But to mine genomic data effectively,
such tools will need to be married to sophisticated analysis packages that
employ advanced statistical techniques. [CHI Bioinformatics] Use of computer-generated graphics to make the
information more accessible and interactive. Related term data mining Visualization
in Bioinformatics link, Alan Robinson, EBI, UK http://industry.ebi.ac.uk/~alan/VisSupp/ visualisation tools: Anything from visual … starting points
for navigation of data to digestions of data into graphical representations
of the results. There are an increasing number of tools being developed
of both generic use (rule, tree, map and other graphing visualisers) and
for bioinformatics (genome browsers, 3D viewers, sequence searching filters,
etc.). Very few of these tools are capable of exploiting multiple databases.
[A Robinson “About Visualisation” EBI, UK Mar 2000] http://industry.ebi.ac.uk/~alan/VisSupp/AboutVisSupp.html IUPAC definitions are reprinted with the permission of the International
Union of Pure and Applied Chemistry. Bibliography [Flake] Gary Computational Beauty of Nature: Computer Explorations of
Fractals, Chaos, Complex Systems and Adaptation. Glossary MIT Press, 2000.
280+ definitions. http://mitpress.mit.edu/books/FLAOH/cbnhtml/glossary-intro.html [IUPAC Combinatorial] International Union of Pure and Applied
Chemistry, Glossary of Terms Used in Combinatorial Chemistry, D. Maclean, J.J.
Baldwin, V.T. Ivanov, Y. Kato, A. Shaw, P. Schneider, and E.M. Gordon, Pure
Appl. Chem., Vol. 71, No. 12, pp. 2349-2365, 1999 http://www.iupac.org/reports/1999/7112maclean/ [IUPAC Computational] International Union of Pure and Applied Chemistry,
Glossary of Terms used in Computational Drug Design, H. van de Waterbeemd, R.E.
Carter, G. Grassy, H. Kubinyi, Y. C.. Martin, M.S. Tute, P. Willett, 1997. 125+
definitions. http://www.iupac.org/reports/1997/6905vandewaterbeemd/glossary.html [NIST] National Institute of Standards and Technology, Dictionary of
Algorithms, Data Structures and Problems, Paul Black, 2001, 1300+ terms http://hissa.nist.gov/dads/terms.html [Statsoft, Inc.] Statistics glossary, Electronic Statistics Textbook, Tulsa
OK, US 2001 http://www.statsoft.com/textbook/stathome.html [Tollenaere]
JP, EE Moret, Hyperglossary of [Molecular Modelling in Drug Design] Terminology,
Utrecht University, 1996. 150+ definitions. http://wwwcmc.pharm.uu.nl/webcmc/glossary.html Alpha glossary index In-depth Algorithms glossary affinity based data mining: Large and complex data sets are analyzed
across multiple dimensions, and the data mining system identifies data
points or sets that tend to be grouped together. These systems differentiate
themselves by providing hierarchies of associations and showing any underlying
logical conditions or rules that account for the specific groupings of
data. This approach is particularly useful in biological motif analysis.
["Data mining" Nature Biotechnology 18: 237-238 Supp. Oct. 2000] Broader term data
mining artificial neural nets: Algorithms simulating the functioning
of human neurons and may be used for pattern recognition problems,
e.g., to establish quantitative structure- activity relationships.
[IUPAC Computational] Broader term neural nets Related term Drug
discovery and development drug design Bayesian inference algorithms: Sequence alignment without gap
penalties or selection of a scoring matrix is just one product of a full
Bayesian approach to bioinformatics. Other products include the following:
1) exact significance measures; 2) explicit elucidation of variation in
conservation at different points in the sequences; 3) the exact probability
of the best alignment as a measure of its merit. Furthermore, since essentially
any of the dynamic programming algorithms used in bioinformatics can be
converted into a Bayesian equivalent similar advantages are accessible
for a broad range of bioinformatics problems. [Intelligent Systems in Molecular
Biology 1998 Montreal] http://www-lbit.iro.umontreal.ca/ISMB98/tutorials/baysian-tut.html
Related terms Bioinformatics glossary,
Sequencing glossary. ClogP values: Calculated 1-octanol/ water partition coefficients,
frequently used in Structure-Property Correlation (SPC)
or
quantitative structure-activity relationship (QSAR) studies
(Leo, 1993). [IUPAC Computational] Logarithm of the partition coefficient. cellular automata: (CA) Cellular Automata are simply finite state
cells based in an N-dimensional world. Famous examples of CAs are Conway's
Life and Wolframs 1D CA set. Cellular automata normally follow relatively
simple sets of rules but have some incredibly complicated behaviour. John von
Neumann worked on a self-replicating and highly complex CA that required 29
states before he died. CAs can be used to simulate life on a very abstract
plane. In fact, it has been found that CAs can be accurately used to model
traffic jams and other human- related phenomenon. [Generation5,
"Artificial Intelligence Glossary] http://www.generation5.org/glossary/c.shtml comparative data mining: Focuses on overlaying large and complex
data sets that are similar to each other ...particularly useful in all
forms of clinical trial meta analyses ... Here the emphasis is on
finding dissimilarities, not similarities. ["Data mining" Nature Biotechnology
Vol. 18: 237-238 Supp Oct.. 2000] Broader term data mining Comparative Molecular Field Analysis (CoMFA): A 3D-QSAR method
that uses statistical correlation techniques for the analysis of the quantitative
relationship between the biological activity of a set of compounds with
a specified alignment, and their three- dimensional electronic and steric
properties. Other properties, such as hydrophobicity and H-bonding
can also be incorporated into the analysis (Cramer et al., 1988; Kubinyi,
1993b). [IUPAC Computational] decision trees: Segregates the data based on values of the variables.
This methodology uses a hierarchy of if- then statements to classify data.
The major advantage of this application is that it is faster and more understandable
than neural nets. However, the major drawback is that data type has to
be interval or categorical. Continuous data will then have to be recorded
into these two data types, thus bringing out the possibility of concealing
significant breakpoints in the data. [Knowledge Discovery in Databases
course, Univ. Arizona, Nov. 1998] http://misdb.bpa.arizona.edu/~mis696g/Reports/DataMining/report1.htm#_Toc433470236 expert systems: Attempt to capture knowledge pertinent
to a specific problem. Techniques exist for helping to extract knowledge
from experts. One such method is the induction of rules from expert- generated
examples of problem solutions. This method differs from discovery in databases
in that the expert examples are usually of much higher quality than the
data in databases, and they usually cover only the important cases. Furthermore,
experts are available to confirm the validity and usefulness of the discovered
patterns. [Knowledge Discovery in Databases course, Univ. Arizona, Nov.
1998] http://misdb.bpa.arizona.edu/~mis696g/Reports/DataMining/report1.htm#_Toc433470236 A computer-based program that encodes rules obtained from process experts
usually in the form of “if - then” statements. [J Glassey et al.
“Issues in the development of an industrial bioprocess advisory system”
Trends in Biotechnology 18 (4):136-41 April 2000] Related term artificial intelligence. factorial design FD: An experimental design technique in which each
variable (factor or descriptor) is investigated at fixed levels.
In a two- level FD, each variable can take two values, e.g., high and low
lipophilicity. [IUPAC Computational] fuzzy: In contrast to binary (true/ false) terms allows for looser
boundaries for sets or concepts. fuzzy logic: A superset of conventional (Boolean) logic that
has been extended to handle the concept of partial truth- truth values
between “completely true” and ‘completely false”. Introduced by Dr.
Lotfi Zadeh (Univ. of California - Berkeley) in the 1960’s as a means to model the uncertainty
of natural language. [AI FAQ, Carnegie Mellon University Computer Science
Department] http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/fuzzy/part1/faq-doc-2.html Hansch analysis: The investigation of the quantitative relationship
between the biological activity of a series of compounds and their
physicochemical substituent or global parameters representing hydrophobic,
electronic, steric and other effects using multiple regression correlation
methodology. [IUPAC Medicinal Chemistry] Related term: QSAR heuristic algorithm: A programming strategy for solving
computationally resistant problems that utilizes self-educating techniques
(i.e., feedback evaluation) to improve performance (e.g., FASTA). Problem
solving by such experimental, trial- and- error methods does not guarantee
the optimal solution. [labvelocity.com] influence based data mining: Complex and granular (as opposed
to linear) data in large databases are scanned for influences between specific
data sets, and this is done along many dimensions and in multi- table formats.
These systems find applications wherever there are significant cause and
effect relationships between data sets - as occurs, for example in large
and multivariant gene expression studies, which are behind areas such as
pharmacogenomics.
["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000] Broader
term data mining k-means clustering: This non-hierarchical method initially takes the
number of components of the population equal to the final required number of
clusters. In this step itself the final required number of clusters is chosen
such that the points are mutually farthest apart. Next, it examines each
component in the population and assigns it to one of the clusters depending on
the minimum distance. The centroid's position is recalculated everytime a
component is added to the cluster and this continues until all the components
are grouped into the final required number of clusters. [Amar B. Rau et. al
"K-means clustering algorithm" Hypertext Learning Center, Center for
the New Engineer, George Mason Univ.] http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust5_bdy.html Broader terms cluster analysis, neural nets molecular pattern recognition: Developing computational methodologies
for the analysis and interpretation of large-scale expression datasets
generated by DNA microarray experiments. Analysis of genome-wide
expression patterns and their correlations with phenotypes of interest
may provide unique insights into the structure of genetic networks and
into biological processes not yet understood at the molecular level.
[Whitehead/ MIT [US] Genome Center's Molecular Pattern Recognition
web site.] http://www.genome.wi.mit.edu/MPR/index.html
Broader term pattern recognition. Related terms Expression glossary Molecular pattern recognition links Molecular Pattern Recognition links, Whitehead Institute, MIT, US http://www.genome.wi.mit.edu/MPR/links.html
Human and model organisms. Molecular Pattern Recognition group projects, Michael Gribskov’s
homepage, San Diego Supercomputer Center , US. http://www.sdsc.edu/~gribskov/gribskov.html mosaic plots: A graphical alternative for qualitative, or categorical,
data … display cross- classified data by constructing rectangles of area
proportional to the counts … likely to become more familiar [to scientists]
and their use is likely to grow. Are to categorical variables what scatterplots
are to continuous variables, and their purpose is the same, to find interesting
patterns of association between variables. [RD Meyer & D Book “Visualization
of data” Current Opinion in Biotechnology 11:89-96, 2000] multivariate statistics: A set of statistical tools to analyze
data (e.g., chemical and biological) matrices using regression and/or pattern
recognition techniques. [IUPAC Computational] Partial Least Squares PLS: Projection to latent structures
(PLS) is a robust multivariate generalized regression method using projections
to summarize multitudes of potentially collinear variables (Wold et al.,
1993). [IUPAC Computational] predictive data mining; Combines pattern matching, influence
relationships, time set correlations, and dissimilarity analysis to offer
simulations of future data sets...these systems are capable of incorporating
entire data sets into their working, and not just samples, which make their
accuracy significantly higher ... used often in clinical trial analysis
and in structure-function correlations. ["Data mining" Nature Biotechnology
Vol. 18: 237-238 Supp. Oct. 2000] Broader term data mining Principal Components Analysis PCA: Computational approach to
reducing the complexity of, for example, a set of descriptors, by identifying
those features which provide the major contributions to observed properties,
and thus reducing the dimensionality of the relevant property space. [IUPAC
Combinatorial Chemistry] A data reduction method using mathematical techniques to identify patterns
in a data matrix. The main element of this approach consists of the construction
of a small set of new orthogonal, i.e., non- correlated, variables derived
from a linear combination of the original variables. [IUPAC Computational] Often confused with common factor analysis. [Neural Network FAQ Part 1] ftp://ftp.sas.com/pub/neural/FAQ.html recursive partitioning: Process for identifying complex structure-
activity relationships in large sets by dividing compounds into
a hierarchy of smaller and more homogeneous sub- groups on the basis of
the statistically most significant descriptors. Related terms clustering,
principal components analysis. [IUPAC Combinatorial Chemistry] SIMCA (SIMple Classification Analysis or Soft Independent Modeling
of Class Analogy): This method is a pattern recognition and
classification technique (Dunn and Wold, 1995). [IUPAC Computational] time delay data mining: The data is collected over time and systems
are designed to look for patterns that are confirmed or rejected as the
data set increases and becomes more robust. This approach is geared
toward long-term clinical trial analysis and multicomponent mode of action
studies. ["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct.
2000] Broader term data mining Algorithms
& data management glossary trends-based data mining: Software analyzes large and complex
data sets in terms of any changes that occur in specific data sets over
time. Data sets can be user- defined or the system can uncover them
itself...This is especially important in cause- and- effect biological experiments.
Screening is a good example. ["Data mining" Nature Biotechnology Vol. 18:
237-238 Supp. Oct. 2000] Broader term data mining |