Practical Data Mining
  How to Use This Book
  How This Book is Organized
  About the Author
  Additional Resources


Inc. 500


Below is a Glossary of Data Mining Terms


Adaptive Logic Network (ALN):  A powerful, trainable, piecewise, linear regression function..


Basic Analysis (e.g., unnormalized roll-ups):  analysis methods relying on simple aggregation (collecting, counting and sorting) of unprocessed data: low-end OLAP.


Best-in-Class Tools vs. Enterprise Suites:  Enterprise suites are usually easier to use  (since they have a single, integrated operational paradigm) but will generally not be optimized in all functions.  Using the Best-in-Class for each separate function provides optimal function-by-function performance, but sacrifices consistency, functional interoperability, and ease of use.


Black Box:  Not having insight into the workings of the system.  Only concerned with input and output and the relationship between them.


Bulk:  Data size, rates and complexity.


Concept:  formally, a relation on a set of attributes.  Intuitively, a thing or idea described in terms of its attributes (e.g., a competent person, a high-speed data source, high quality information)


Concept representation:  as a noun, the formal scheme used to depict the attributes of a concept.  As a verb, the process of defining and instantiating such a scheme.


Correlation Tools: Tools that provide a measure of relation between two variables.


DBMS:  Data Base Management System 


Data Management:  The management of the data being mined.  This includes data collection, preparation, evaluating data quality and relevance, and data classification.


Data Mining (DM):  The detection, characterization, and exploitation of actionable patterns in data.  Data mining has tow components:  Knowledge Discovery and Predictive Modeling


DM Program Management:  management (cost, schedule, performance) of the DM process.  The empirical experimental nature of DM as a rapid prototyping effort necessitates the use of special management techniques.


DM as rapid prototyping:  data mining, as an empirical search for hidden latent patterns, cannot be completely planned in advance.  Therefore, it is usually conducted under a rapid prototyping (“spiral”) methodology, allowing goals and methods to be adjusted as discoveries are made.


DM Standards:  DM is essentially the application of the scientific method to data analysis; it cannot be done haphazardly.  Several methodologies are in use, SEMMA and CRISP-DM being predominant in the industry.  A markup language for predictive modeling, PMML, is currently under development by a committee of industry practitioners.


Data Preparation: the process of conditioning data for analysis; includes normalization, registration, error detection and correction, gap filling, coding, quantization, and formatting.


Data Quality:  General term referring to the readiness of data for processing.  Data is of higher quality when it is representative of the domain, contains few gaps and outliers, and offers easy access to relevant actionable information.


Data Representation:  Data types, formats and schemas


Decision Trees: separate out data into sets of rules which are likely to have a different effect on a target variable.


Demographic and Behavioral Data:  data about entities that exhibit behaviors, such as persons, companies, governments, etc.  Demographic data describes what an entity is (its attributes), while behavioral data describes what an entity does (actions, motivations, history).


Distributed data and information:  data required for analysis is often not available from a single source: it is distributed.  Once data has been collected, this problem is encountered again with information: information is often only found when many data items are brought together in the proper combination.


Enterprise Intelligence Tool Suite:  an integrated or interoperable collection of information analysis tools designed to be used according to a consistent methodology to solve enterprise problems.


Features:  Symbolic representation of attributes of a member of a population (Weight in points, revenue in dollars, gender as M/F, etc.)


Feature Set Operations: operations performed on feature data, such as normalization, rounding, coding, etc.


High-End Custom Applications (general non-model based regression): the use of advanced adaptive regression methods for predictive modeling (e.g., neural networks, radial basis functions, support vector machines).  These so-called “black box” methods are used when the data or the domain are not well understood, or are extremely complex.


HMI:  Human Machine Interface.  Refers to the means by which a computing system and its users interact.


Infrastructure:  The environment the data mining system will reside on.  This will include system architecture, supported languages and HMI.


Knowledge Base:  An organized collection of heuristics relevant to a particular problem domain.


Knowledge Base Expert System (KBES):  A predictive model that applies codified expert-level human knowledge in a particular problem domain according to an appropriate inference methodology.  KBES are typically built for forensic applications (Diagnostics, planning, classification.  KBES are architecturally primitive and strictly segregate heuristics (their knowledge base) from the inference engine.


Knowledge Discovery (KD):  The first component of data mining.  Systematically using manual and automated tools to detect and characterize actionable patterns in data.


Meta Data:  Information about data.  This includes such facts as the number and type of data stored, where it is located, how it is organized and so on.




Model Management:  The method of managing models and results when using the models.


Model Test and Evaluation:  To test and evaluation the models used to consider the best single or best combination of models in addressing the problems at hand and satisfying the objective.


Neural Network (NN):  Mathematical transform whose values are computed  through the cooperation of many simple transforms.  Usually a synonym for “Multi-layer Perception”.


Online Analytical Processing (OLAP):  Conventional data aggregation and representation for the support of (mostly manual)data mining by an analyst:  “retrieve, segment, slice, dice, drilldown, count, display, report”.


Operational Issues:  considerations that must be made when a sophisticated application is ported from the development environment, where conditions are carefully controlled, to the operational environment, where they are not.  Problems in this area often arise as a result of false assumptions made about the operational environment during development.


Predictive Modeling:  The second component of data mining:  Using the results of knowledge discovery to construct applications (models) that solve business problems,  Predictive models are generally classifiers (detect “fraud”, categorize customers, etc.) or predictors (estimate future revenue, predict “churn”, etc.).


Query and Reporting:  an OLAP function by which data satisfying a set of user-specified constraints are accessed and formatted for presentation.


Radical Basis Function (RBF):  A very powerful “kernel-based” Classification paradigm.


Relevance/Independence of Features:  features are relevant when they contain information useful in addressing enterprise problems.  Features are independent when the information they contain is not present in other features.


Rule:  A relationship between facts expressed as a structured construct (e.g. IF-THEN-ELSE statement).  The rule is the fundamental unit of domain knowledge in a KBES.


Rule Induction:  Creating rules directly from data without human assistance.


Specification Suite:  Establishing requirements and expectations.


Statistical Tools (e.g., EXCEL): analysis tools based upon sampling and statistical inferencing techniques (e.g., high-end OLAP).


Supervise Learning:  A training process that uses known ground-truth for each training vector to evaluate and improve the learning machine.


Support Vectors Modeling (SVM):  A powerful predictive modeling technique that creates classifiers y modeling class boundaries.


Tools for cognitive engine parameter selection: automated tools for guiding the selection of training and operational settings for cognitive engines, such as learning rates, momentum factors, termination conditions, annealing schedules, etc.


Tools/methods for application profiling (user, data):  tools for assisting the developer of cognitive engines in analyzing the problem domain and domain experts in order to quickly and accurately focus data mining efforts.  No automated tools exist, but manual processes do.


Tools and methods for model scoring and evaluation: tools for assessing the relative performance of cognitive engines.  Includes such things as lift curves, confusion matrices, ambiguity measures, visualization, statistical measures, etc.


Tools for predictive modeling paradigm selection: automated tools for assisting the developer of cognitive engines in selecting the proper analysis and modeling paradigms (e.g., neural net vs. rule-based system.)


Unsupervised Learning:  A training process that detects and characterizes previously unspecified patterns in data. 


Visualization:  Depiction of data in visual form so that quality and relationships may be observed by a human analyst.


White Box:  Have insight into the workings of the data mining system and how the outcome is produced.



Related Links

Practical Data Mining