You are here Glossary homepage/Search > Informatics > Algorithms & data management
 Algorithms & data management glossary
Evolving terminology for emerging technologies
Suggestions? Comments? Questions? chitty@healthtech.com
Last revised December 26, 2001
 
 
With changes in sequencing technology and methods, the rate of acquisition of human and other genome data over the next few years will be ~100 times higher than originally anticipated. Assembling and interpreting these data will require new and emerging levels of coordination and collaboration in the genome research community to develop the necessary computing algorithms, data management and visualization system.  [Lawrence Berkeley Lab, US "Advanced Computational Structural Genomics"]  http://cbcg.lbl.gov/ssi-csb/Meso.html

Related glossaries include Applications: Drug Discovery & Development, Sequencing, Informatics: Bioinformatics, Chemoinformatics, Computers & computing, Molecular Modeling, Research Biology: Protein Structures, Sequences, DNA & beyond.  Additional definitions appear in the In-depth glossary, after the Bibliography.

3D-QSAR Three-Dimensional Quantitative Structure-Activity Relationships:  Involves the analysis of  the quantitative relationship between the biological activity of a set of compounds and their three- dimensional properties using statistical correlation methods. [IUPAC Computational]  Broader terms QSAR; Drug discovery & development  SAR Structure Activity Relationship  Narrower terms In-depth CoMFA Comparative Molecular Field Analysis Related term Drug discovery & development drug design

algorithm: A computable set of steps to achieve a desired result.

Note: The word comes from the Persian author Abu Ja'far Mohammed ibn Mûsâ al-Khowârizmî who wrote a book with arithmetic rules dating from about 825 A.D. [NIST]

Rules or a process, particularly in computer science. In medicine a step by step process for reaching a diagnosis or ruling out specific diseases.  May be expressed as a flow chart in either sense. Greater efficiencies in algorithms, as well as improvements in computer hardware have led to advances in computational biology. 

Narrower terms sequencing algorithms; In-depth Bayesian inference algorithms, genetic algorithm, heuristic algorithm.  Related terms heuristic, parsing; Sequencing dynamic programming methods.

annotation: Bioinformatics glossary.

artificial intelligence (AI): A wide- ranging term encompassing computer applications that have the ability to make decisions; the ability to explain reasoning is evidence of intelligence.  Also covers methods that have the ability to learn. [J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136-41 April 2000] 

Or as some people have noted, laboriously trying to get computers to do what people do intuitively, without great effort. Conversely there are things computer can do (relatively) effortlessly such as massive numbers of  error- free calculations. The most promising applications seem to involve incorporating both computer aided consideration of many possibilities, combined with human judgment.  

Narrower terms In-depth cellular automata, expert systems, fuzzy logic, genetic algorithms, neural nets Related term training sets. 

Artificial Intelligence Links
American Association of Artificial Intelligence: Topics http://www.aaai.org/AITopics/html/current.html

Virtual Library Artificial Intelligence, David Corne, March 1997 http://www.u.arizona.edu/~avs/ACG/AI.html   University and government research sites, newsgroups, commercial sites and products, programming languages, journals, bibliographies, “interactive things” and other information

biometrics: The information age is quickly revolutionizing the way transactions are completed. Everyday actions are increasingly being handled electronically, instead of with pencil and paper or face to face. This growth in electronic transactions has resulted in a greater demand for fast and accurate user identification and authentication. Biometric technology is a way to achieve fast, user- friendly authentication with a high level of accuracy. [Biometrics Consortium]  http://www.biometrics.org/REPORTS/CTSTG96/

cluster analysis: The clustering, or grouping, of  large data sets (e.g., chemical and/ or pharmacological data sets) on the basis of similarity criteria for appropriately scaled  variables that represent the data of interest. Similarity criteria (distance based, associative, correlative, probabilistic) among the several clusters facilitate the recognition of patterns and reveal otherwise hidden structures (Rouvray, 1990; Willett, 1987, 1991). [IUPAC Computational]

This data-analysis approach uses standard statistical algorithms to arrange genes according to similarity in patterns of gene expression. The output is displayed graphically, conveying the clustering and the underlying gene expression data simultaneously. (Eisen MB, et al. "Cluster analysis and display of genome- wide expression patterns." Proceedings of the National Academy of Sciences, U.S.A. 1998;95:14863-14868.) Clusters, and the genes within them, can be examined for commonalities in function or sequence to help researchers better understand how and why they behave similarly. [CHI Microarrays]  

S. cerevisiae GenomeCluster Analysis and Display of Genome-wide Expression Patterns, Stanford Univ., US    http://rana.Stanford.EDU/clustering/   An online supplement to Mike Eisen’s 1998 PNAS article [above reference]

Has been used in medicine to create taxonomies of diseases and diagnosis and in archaeology to establish taxonomies of stone tools and funereal objects. 

Related terms hierarchical clustering, pattern recognition. Narrower term k-means clustering, self- organizing maps

collaborative filtering:  Tools that leverage user preferences, patterns, and purchasing behavior to customize organization and navigation systems. [Peter Morville "Software for Information Architects" Argus Center for Information Architecture, 2000] http://argus-acia.com/strange_connections/current_article.html

Amazon's recommendations based on what other buyers of a specific title are a familiar example of collaborative filtering.

common factor analysis: See under In-depth principle component analysis PCA

comparative molecular field analysis (CoMFA) is a 3D-QSAR method that uses statistical correlation techniques for the analysis of the quantitative relationship between the biological activity of a set of compounds with a specified alignment, and their three-dimensional electronic and steric properties. Other properties such as hydrophobicity and hydrogen bonding can also be incorporated into the analysis. (See also Three-dimensional Quantitative Structure-Activity Relationship [3D-QSAR]). [IUPAC Medicinal Chemistry]

data cleaning: Removal and/or correction of erroneous data introduced by data entry errors, expired validity of data, or by some other means. [Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary] http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905

The quality of data in sequence databases is highly variable.  This is receiving increasing attention.  Ensembl (Bioinformatics In-depth) differentiates data of varying quality. 

Related term data reduction methods

data integration: Related terms Bioinformatics interoperability, XML

data management methods: Include algorithms, artificial intelligence, data cleaning, data mining, data reduction methods, expert systems, factorial design, fuzzy logic, knowledge based systems, neural networks, normalization, parsing, pattern recognition, SPC Structure- Property Correlations, visualization and various statistical methods. In-depth CoMFA, decision tress, factorial design, mosaic plots, multivariate statistics, Partial Least Squares PLS, Principal Components Analysis PCA, recursive partitioning Clinical genomics glossary meta-analysis  

data mart: See under data warehouse.

data mining: Nontrivial extraction of implicit, previously unknown and potentially useful information from data, or the search for relationships and global patterns that exist in databases. [Bob Klevecz "The Whole EST Catalog" Scientist 12 (2): 22 Jan 18 1999]

Exploration and analysis, by automatic or semi-automatic means, of large quantities of data in order to discover meaningful patterns or rules. [Berry, MJA, Data Mining Techniques for Marketing, Sales and Customer Support John Wiley & Sons, New York 1997 cited in Nature Genetics 21(15): 51-55 ref 11, 1999]

May need to incorporate related techniques such as cluster analysis or visualization.

Narrower terms In- depth affinity based data mining, comparative data mining, influence-based data mining, predictive data mining, text mining, time delay data mining, trends analysis data mining.  Imaging image data mining. Related terms data warehouse

data reduction methods: Includes cluster analysis, currently the best known data reduction method in the microarray field. [CHI Bioinformatics] 

Related term data cleaning

data warehouse: An integrated repository of data from multiple, possibly heterogeneous data sources, presented with consistent and coherent semantics. Warehouses usually contain summary information represented on a centralized storage facility. [Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary] http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905

The term was coined by W. H. Inmon. .. Typically,  a data warehouse is housed on an enterprise mainframe server. ... Data warehousing emphasizes the capture of data from diverse sources for useful analysis and access, but does not  generally start from the point- of- view of the end user or knowledge worker who may need access to specialized, sometimes local databases. The latter idea is  known as the data mart. [whatis.com] 

Related terms data mining, global schema

evolutionary computation methods: Include genetic algorithms (GAs) or genetic programming (GPs) which may make it possible to discriminate between common infectious agents, monitor complex industrial bioprocesses, and detect specific chemical biomarkers in bacteria. Roy Goodacre "Evolutionary Computation for Interpretation of Metabolomic Data"  Metabolic Profiling Dec. 3-4, 2001 Chapel Hill, NC

experimental design: The use of mathematical and statistical methods to select the minimum number of experiments or compounds for optimal coverage of descriptor or variable space.  [IUPAC Computational]

functional genomic data: Functional genomics glossary

genetic algorithm GA : Method for library design by evaluating the fit of a parent library to some desired property (e.g. the level of activity in a biological assay, or the computationally determined diversity of the compound set) as measured by a fitness function. The design of more optimal daughter libraries is then carried out by a heuristic process with similarities to genetic selection in that it employs replication, mutation, deletions etc. over a number of generations. [IUPAC Combinatorial Chemistry]

An optimization algorithm based on the mechanisms of Darwinian evolution which uses random mutation, crossover and selection procedures to breed better models or solutions from an originally random starting population or sample. (Rogers and Hopfinger, 1994). [IUPAC Computational

Genetic Algorithms Archive http://www.aic.nrl.navy.mil/galist/

Related terms  Computers & computing evolutionary computation ; Drug discovery & development drug design  

genome mining: In an initial data- mining effort, the draft human genome was searched to find paralogs of known tumor suppressor genes, and for gene arrangements, which are typical of oncogenes, in cancer cells. The results were disappointing, indicating that although knowledge of the human genome will undoubtedly be of great help, other approaches to identify new oncogenes are needed.  [TG Boyer et. al. "Genome mining for human cancer genes: wherefore art thou?" Trends in Molecular Medicine 7 (5) : 187- 189, May 2001]  http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11325617&dopt=Abstract

genomic data: Genomics glossary

global schema: A schema, or a map of the data content of a data warehouse that integrates the schemata from several source repositories. It is "global", because it is presented to warehouse users as the schema that they can query against to find and relate information from any of the sources, or from the aggregate information in the warehouse. [Lawrence Berkeley Lab "Advanced Computational Structural Genomics" Glossary] Broader term schema. http://cbcg.lbl.gov/ssi-csb/Meso.html#anchor597905

heuristic: Tools such as genetic algorithms or neural networks employ heuristic methods to derive solutions which may be based on purely empirical information and which have no explicit rationalization. [IUPAC Combinatorial Chemistry] 

Trial and error methods.  Narrower term: In-depth heuristic algorithm

Hidden Markov Models HMM: Molecular modeling glossary

Related term simulated annealing

hierarchical clustering: An unsupervised clustering approach that has been proven valuable for recognizing patterns in the gene expression data. However, since the output of hierarchical clustering is a tree- like structure, rather than separate clusters as produced by self- organizing map (SOM), it is often arbitrary to determine whether two subtrees should belong to a single cluster. [Jinfeng Liu "Analysis of yeast microarray gene expression data: Hierarchical Clustering and Self- Organizing Map" Columbia Univ. Bioinformatics 2000] http://cubic.bioc.columbia.edu/~liu/Project2/proposal.html

knowledge based systems: An extension of the expert system concept wherein additional forms of knowledge, such as mathematical models, are incorporated with the expert rules. [J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136-41 April 2000] Related term data mining.   

knowledge management: The ability to navigate through and analyze large amounts of data, and to ensure a flow of knowledge to the right people at the right time, is emerging as a major competitive advantage. This is especially critical as companies seek to exploit emerging technologies, coordinate research and activities across large organizational and geographic distances, and manage costs and projects effectively. Knowledge management systems are designed to capture much of the tacit capabilities of an organization, including the skills and experience of its employees. [CHI Summit Report, Transforming the Pharmaceutical Industry – The Industrialization of Research and New Market Realities ]

Knowledge Management link
Virtual Library: Knowledge Management, May 2000  http://www.brint.com/km/  Definition, articles, white papers, interviews, business and technology library,  periodicals and publications, “out of box thinking”,  “movers and shakers”, “think tank”, calendar of events, emerging topics.

lexical parsing: See under parsing.

neural networks: Technique for optimizing a desired property given a set of items which have been previously characterized with respect to that property (the 'training set'). Features of members of the training set which correlate with the desired property are 'remembered and used to generate a model for selecting new items with the desired property or to predict the fit of an unknown member. [IUPAC Combinatorial Chemistry] 

Communication between statisticians and neural net researchers is often hindered by the different terminology used in the two fields. There is a comparison of neural net and statistical jargon in ftp://ftp.sas.com/pub/neural/jargon  [Neural Network FAQ Part 1] ftp://ftp.sas.com/pub/neural/FAQ.html

Narrower term In-depth artificial neural networks. Often uses In-depth fuzzy logic. Related terms artificial intelligence; Molecular modeling glossary self- organizing maps   

normalization: In creating a database, normalization is the process of organizing it into tables in such a way that the results of using the database are always unambiguous and as intended. Normalization may have the effect of duplicating data within the database and often results in the creation of additional tables. (While normalization tends to increase the duplication of data, it does not introduce redundancy, which is unnecessary duplication.) Normalization is typically a refinement process after the initial exercise of identifying the data objects that should be in the database, identifying their relationships, and defining the tables required and the columns within each table. [whatis.com] 

parsing: Using algorithms to analyze data into components. Semantic parsing involves trying to figure out what the components mean. Lexical parsing refers to the process of deconstructing the data into components.

pattern recognition (PR): The identification of patterns in large data sets using appropriate mathematical methodologies.  Examples are principal component  analysis (PCA), SIMCA, partial least squares (PLS) and artificial neural  networks (ANN) (Rouvray, 1990; Van de Waterbeemd, 1995ab) [IUPAC  Computational] Narrower terms In- depth artificial neural  networks, molecular pattern recognition, principal component  analysis (PCA), SIMCA, partial least squares (PLS)

probability: Probability web http://www.mathcs.carleton.edu/probweb/probweb.html

protein and mRNA data: Proteomics glossary

Quantitative Structure-Activity Relationships QSAR: Mathematical relationships linking chemical structure and pharmacological activity in a quantitative manner for a series of compounds. Methods which can be used in QSAR include various regression and pattern recognition techniques. QSAR is often taken to be equivalent to chemometrics or multivariate statistical data analysis.  It is sometimes used in a more limited sense as equivalent to Hansch analysis. QSAR is a subset of the more general term SPC.  [IUPAC Computational]

The building of structure – biological activity models by using regression analysis with physicochemical constants, indicator variables or theoretical calculations. The term has been extended by some authors to include chemical reactivity, i.e. activity is regarded as synonymous with reactivity. This extension is, however, discouraged. Related term correlation analysis. [IUPAC Compendium]

Related terms SAR Structure Activity Relationship; In-depth Hansch analysis; Drug discovery & development drug design

regression analysis: The use of statistical  methods for modeling a set of dependent variables, Y, in terms of combinations of  predictors, X. It includes methods such as multiple linear regression (MLR) and partial least squares (PLS). [IUPAC Computational]

regression to the mean: A common misconception about genetics has to do with overgeneralization about the likelihood of increased quality by selective breeding.  Two very tall parents will tend to produce offspring who are taller than the average population -- but less tall than the average of the parents' heights.  Or as George Bernard Shaw is supposed to have said to a famous beauty who suggested they have a child ""With your brains and my looks ..." He retorted, "But what if the child had my looks and your brains?" 

robust:  A statistical test that yields approximately correct results despite the falsity of certain of the assumptions on which it is based  [OED] Hence, can refer to a process which is relatively insensitive to human foibles and variables in the way (for example, an assay) is carried out.

SAR Structure Activity Relationship: Drug discovery & development Narrower terms 3D-QSAR, QSAR

SPC Structure-Property Correlations: All statistical mathematical methods used to correlate any molecular property (intrinsic, chemical or biological) to any other property, using statistical regression or pattern recognition techniques (Van de Waterbeemd, 1992). QSAR is a subset of the more general term SPC.  [IUPAC Computational]

Narrower terms: 3D QSAR, QSAR

schema: Bioinformatics glossary  Narrower term In-depth global schema

self- organizing map: A type of mathematical cluster analysis that is particularly well suited for recognizing and classifying features in complex, multidimensional data. The method has been implemented in a publicly available computer package, GENECLUSTER, that performs the analytical calculations and provides easy data visualization … Expression patterns of some 6,000 human genes were assayed, and an online database was created. GENECLUSTER was used to organize the genes into biologically relevant clusters that suggest novel hypotheses about hematopoietic differentiation. [P. Tamayo et al “Interpreting patterns of gene expression with self- organizing maps: methods and application to hematopoietic differentiation” PNAS 96(6):2907-12 Mar 16, 1999] self organization: A process where the organization (constraint, redundancy) of a system spontaneously increases, i.e. without this increase being controlled by the environment or an encompassing or otherwise external system. [F. Heylighen, "Self Organization" Jan 27, 1997  in: F. Heylighen, C. Joslyn and V. Turchin (editors): Principia Cybernetica Web (Principia Cybernetica, Brussels)]  

Related term neural networks http://pespmc1.vub.ac.be/SELFORG.html

semantic parsing: See under parsing.

sequencing algorithms: See BLAST, FASTA, Needleman - Wunsch, Smith - Waterman Sequencing Glossary  In-depth

stochastic: "Aiming, proceeding by guesswork" (Webster's Collegiate Dictionary). Term which is often applied to combinatorial processes involving true random sampling, such as selection of beads from an encoded library, or certain methods for library design. [IUPAC COMBINATORIAL CHEMISTRY]

Truly random, based on probability.

text mining: Using data mining on unstructured data, such as the biomedical literature.  Related term Computers & computing natural language processing  

training set: Rule based example sets. Related term neural networks.

visualization: Among the most significant unmet needs in bioinformatics are for improved visualization and data- mining software. Now that researchers are regularly dealing with hundreds of thousands to millions of data points, visualization is critical. But to mine genomic data effectively, such tools will need to be married to sophisticated analysis packages that employ advanced statistical techniques. [CHI Bioinformatics]

Use of computer-generated graphics to make the information more accessible and interactive. Related term data mining

Visualization in Bioinformatics link, Alan Robinson, EBI, UK http://industry.ebi.ac.uk/~alan/VisSupp/

visualisation tools:  Anything from visual … starting points for navigation of data to digestions of data into graphical representations of the results. There are an increasing number of tools being developed of both generic use (rule, tree, map and other graphing visualisers) and for bioinformatics (genome browsers, 3D viewers, sequence searching filters, etc.). Very few of these tools are capable of exploiting multiple databases. [A Robinson “About Visualisation” EBI, UK Mar 2000]   http://industry.ebi.ac.uk/~alan/VisSupp/AboutVisSupp.html

IUPAC definitions are reprinted with the permission of the International Union of Pure and Applied Chemistry.

Bibliography

[Flake] Gary Computational Beauty of Nature: Computer Explorations of Fractals, Chaos, Complex Systems and Adaptation. Glossary MIT Press, 2000. 280+ definitions. http://mitpress.mit.edu/books/FLAOH/cbnhtml/glossary-intro.html

[IUPAC Combinatorial] International Union of Pure and Applied Chemistry, Glossary of Terms Used in Combinatorial Chemistry, D. Maclean, J.J. Baldwin, V.T. Ivanov, Y. Kato, A. Shaw, P. Schneider, and E.M. Gordon, Pure Appl. Chem., Vol. 71, No. 12, pp. 2349-2365, 1999  http://www.iupac.org/reports/1999/7112maclean/

[IUPAC Computational] International Union of Pure and Applied Chemistry, Glossary of Terms used in Computational Drug Design, H. van de Waterbeemd, R.E. Carter, G. Grassy, H. Kubinyi, Y. C.. Martin, M.S. Tute, P. Willett, 1997. 125+ definitions. http://www.iupac.org/reports/1997/6905vandewaterbeemd/glossary.html

[NIST] National Institute of Standards and Technology, Dictionary of Algorithms, Data Structures and Problems, Paul Black, 2001, 1300+  terms  http://hissa.nist.gov/dads/terms.html

[Statsoft, Inc.] Statistics glossary, Electronic Statistics Textbook, Tulsa OK, US 2001 http://www.statsoft.com/textbook/stathome.html

[Tollenaere] JP, EE Moret, Hyperglossary of [Molecular Modelling in Drug Design] Terminology, Utrecht University, 1996. 150+ definitions. http://wwwcmc.pharm.uu.nl/webcmc/glossary.html

Alpha glossary index

In-depth Algorithms glossary

affinity based data mining: Large and complex data sets are analyzed across multiple dimensions, and the data mining system identifies data points or sets that tend to be grouped together.  These systems differentiate themselves by providing hierarchies of associations and showing any underlying logical conditions or rules that account for the specific groupings of data.  This approach is particularly useful in biological motif analysis. ["Data mining" Nature Biotechnology 18: 237-238 Supp. Oct. 2000] Broader term data mining 

artificial neural nets: Algorithms simulating the functioning of human neurons and may be used for pattern recognition problems, e.g., to establish quantitative structure- activity relationships. [IUPAC Computational] Broader term neural nets Related term Drug discovery and development drug design

Bayesian inference algorithms: Sequence alignment without gap penalties or selection of a scoring matrix is just one product of a full Bayesian approach to bioinformatics. Other products include the following: 1) exact significance measures; 2) explicit elucidation of variation in conservation at different points in the sequences; 3) the exact probability of the best alignment as a measure of its merit. Furthermore, since essentially any of the dynamic programming algorithms used in bioinformatics can be converted into a Bayesian equivalent similar advantages are accessible for a broad range of bioinformatics problems. [Intelligent Systems in Molecular Biology 1998 Montreal]  http://www-lbit.iro.umontreal.ca/ISMB98/tutorials/baysian-tut.html  Related terms Bioinformatics glossary, Sequencing glossary.

ClogP values: Calculated 1-octanol/ water partition coefficients, frequently used in   Structure-Property Correlation (SPC) or quantitative structure-activity relationship (QSAR) studies (Leo, 1993).  [IUPAC Computational]

Logarithm of the partition coefficient.

cellular automata: (CA) Cellular Automata are simply finite state cells based in an N-dimensional world. Famous examples of CAs are Conway's Life and Wolframs 1D CA set. Cellular automata normally follow relatively simple sets of rules but have some incredibly complicated behaviour. John von Neumann worked on a self-replicating and highly complex CA that required 29 states before he died. CAs can be used to simulate life on a very abstract plane. In fact, it has been found that CAs can be accurately used to model traffic jams and other human- related phenomenon.  [Generation5, "Artificial Intelligence Glossary]  http://www.generation5.org/glossary/c.shtml

comparative data mining: Focuses on overlaying large and complex data sets that are similar to each other ...particularly useful in all forms of clinical trial meta  analyses ... Here the emphasis is on finding dissimilarities, not similarities. ["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp Oct.. 2000] Broader term data mining

Comparative Molecular Field Analysis (CoMFA): A 3D-QSAR method that uses statistical correlation techniques for the analysis of the quantitative relationship between the biological activity of a set of compounds with a specified alignment, and their three- dimensional electronic and steric properties. Other properties, such as  hydrophobicity and H-bonding can also be incorporated into the analysis (Cramer et al., 1988; Kubinyi, 1993b).  [IUPAC Computational]

decision trees: Segregates the data based on values of the variables. This methodology uses a hierarchy of if- then statements to classify data. The major advantage of this application is that it is faster and more understandable than neural nets. However, the major drawback is that data type has to be interval or categorical. Continuous data will then have to be recorded into these two data types, thus bringing out the possibility of concealing significant breakpoints in the data. [Knowledge Discovery in Databases course, Univ. Arizona, Nov. 1998]   http://misdb.bpa.arizona.edu/~mis696g/Reports/DataMining/report1.htm#_Toc433470236

expert systems:  Attempt to capture knowledge pertinent to a specific problem. Techniques exist for helping to extract knowledge from experts. One such method is the induction of rules from expert- generated examples of problem solutions. This method differs from discovery in databases in that the expert examples are usually of much higher quality than the data in databases, and they usually cover only the important cases. Furthermore, experts are available to confirm the validity and usefulness of the discovered patterns. [Knowledge Discovery in Databases course, Univ. Arizona, Nov. 1998] http://misdb.bpa.arizona.edu/~mis696g/Reports/DataMining/report1.htm#_Toc433470236

A computer-based program that encodes rules obtained from process experts usually in the form of  “if - then” statements. [J Glassey et al. “Issues in the development of an industrial bioprocess advisory system” Trends in Biotechnology 18 (4):136-41 April 2000]

Related term artificial intelligence.

factorial design FD: An experimental design technique in which each variable (factor or  descriptor) is investigated at fixed levels. In a two- level FD, each variable can take two values, e.g., high and low lipophilicity.  [IUPAC Computational]

fuzzy: In contrast to binary (true/ false) terms allows for looser boundaries for sets or concepts.

fuzzy logic: A superset of conventional (Boolean) logic that has been extended to handle the concept of  partial truth- truth values between “completely true” and ‘completely false”.  Introduced by Dr. Lotfi  Zadeh (Univ. of California - Berkeley) in the 1960’s as a means to model the uncertainty of natural language. [AI FAQ, Carnegie Mellon University Computer Science Department] http://www.cs.cmu.edu/Groups/AI/html/faqs/ai/fuzzy/part1/faq-doc-2.html

Hansch analysis: The investigation of the quantitative relationship between the biological activity of a series of compounds and their physicochemical substituent or global parameters representing hydrophobic, electronic, steric and other effects using multiple regression correlation methodology. [IUPAC Medicinal Chemistry]

Related term: QSAR

heuristic algorithm:  A programming strategy for solving computationally resistant problems that utilizes self-educating techniques (i.e., feedback evaluation) to improve performance (e.g., FASTA). Problem solving by such experimental,  trial- and- error methods does not guarantee the optimal solution. [labvelocity.com]

influence based data mining: Complex and granular (as opposed to linear) data in large databases are scanned for influences between specific data sets, and this is done along many dimensions and in multi- table formats.  These systems find applications wherever there are significant cause and effect relationships between data sets - as occurs, for example in large and multivariant gene expression studies, which are behind areas such as pharmacogenomics. ["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000] Broader term data mining

k-means clustering: This non-hierarchical method initially takes the number of components of the population equal to the final required number of clusters. In this step itself the final required number of clusters is chosen such that the points are mutually farthest apart. Next, it examines each component in the population and assigns it to one of the clusters depending on the minimum distance. The centroid's position is recalculated everytime a component is added to the cluster and this continues until all the components are grouped into the final required number of clusters. [Amar B. Rau et. al "K-means clustering algorithm" Hypertext Learning Center, Center for the New Engineer, George Mason Univ.] http://cne.gmu.edu/modules/dau/stat/clustgalgs/clust5_bdy.html  

Broader terms cluster analysis, neural nets

molecular pattern recognition: Developing computational methodologies for the analysis and interpretation of large-scale expression datasets generated by DNA  microarray experiments. Analysis of genome-wide expression patterns and their correlations with phenotypes of interest may provide unique insights into the structure of genetic networks and into biological processes not yet  understood at the molecular level. [Whitehead/ MIT [US] Genome Center's  Molecular Pattern Recognition web site.] http://www.genome.wi.mit.edu/MPR/index.html  Broader term pattern recognition. Related terms Expression glossary

Molecular pattern recognition links
Molecular Pattern Recognition links, Whitehead Institute, MIT, US http://www.genome.wi.mit.edu/MPR/links.html  Human and model organisms.

Molecular Pattern Recognition group projects,  Michael Gribskov’s homepage, San Diego Supercomputer Center , US.   http://www.sdsc.edu/~gribskov/gribskov.html

mosaic plots: A graphical alternative for qualitative, or categorical, data … display cross- classified data by constructing rectangles of area  proportional to the counts … likely to become more familiar [to scientists] and their use is likely to grow. Are to categorical variables what scatterplots are to continuous variables, and their purpose is the same, to find interesting patterns of association between variables. [RD Meyer & D Book “Visualization of data” Current Opinion in Biotechnology 11:89-96, 2000]

multivariate statistics: A set of statistical tools to analyze data (e.g., chemical and biological) matrices using regression and/or pattern recognition techniques. [IUPAC Computational]

Partial Least Squares PLS: Projection to latent structures (PLS) is a robust multivariate generalized regression method using projections to summarize multitudes of potentially collinear variables (Wold et al., 1993).  [IUPAC Computational]

predictive data mining; Combines pattern matching, influence relationships, time set correlations, and dissimilarity analysis to offer simulations of future data sets...these systems are capable of incorporating entire data sets into their working, and not just samples, which make their accuracy significantly higher ... used often in clinical trial analysis and in structure-function correlations. ["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000] Broader term data mining

Principal Components Analysis PCA: Computational approach to reducing the complexity of, for example, a set of descriptors, by identifying those features which provide the major contributions to observed properties, and thus reducing the dimensionality of the relevant property space. [IUPAC Combinatorial Chemistry]

A data reduction method using mathematical techniques to identify patterns in a data matrix. The main element of this approach consists of the construction of a small set of new orthogonal, i.e., non- correlated, variables derived from a linear combination of the original variables. [IUPAC Computational]

Often confused with common factor analysis. [Neural Network FAQ Part 1] ftp://ftp.sas.com/pub/neural/FAQ.html

recursive partitioning: Process for identifying complex structure- activity relationships in large sets by dividing compounds into a hierarchy of smaller and more homogeneous sub- groups on the basis of the statistically most significant descriptors. Related terms clustering,  principal components analysis. [IUPAC Combinatorial Chemistry]

SIMCA (SIMple Classification Analysis or Soft Independent Modeling of Class Analogy): This method is a pattern recognition and classification  technique (Dunn and Wold, 1995). [IUPAC Computational]

time delay data mining: The data is collected over time and systems are designed to look for patterns that are confirmed or rejected as the data set increases and becomes more robust.  This approach is geared toward long-term clinical trial analysis and multicomponent mode of action studies. ["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000] Broader term data mining Algorithms & data management glossary

trends-based data mining: Software analyzes large and complex data sets in terms of any changes that occur in specific data sets over time.  Data sets can be user- defined or the system can uncover them itself...This is especially important in cause- and- effect biological experiments.  Screening is a good example. ["Data mining" Nature Biotechnology Vol. 18: 237-238 Supp. Oct. 2000] Broader term data mining

 


Cambridge
Healthtech Institute
1037 Chestnut Street
Newton Upper Falls, Ma 02464
Phone:
617-630-1300
Fax:  617-630-1325
Email: chi@healthtech.com