|
GlossaryBelow is a Glossary of Data Mining Terms Adaptive Logic Network (ALN): A
powerful, trainable, piecewise, linear regression function.. Basic Analysis (e.g., unnormalized roll-ups): analysis methods relying on simple
aggregation (collecting, counting and sorting) of unprocessed data: low-end
OLAP. Best-in-Class Tools vs. Black Box:
Not having insight
into the workings of the system. Only
concerned with input and output and the relationship between them. Bulk:
Data size, rates and
complexity. Concept: formally, a relation on a set of
attributes. Intuitively, a thing or idea
described in terms of its attributes (e.g., a competent person, a high-speed
data source, high quality information) Concept representation: as a noun, the formal scheme used to depict
the attributes of a concept. As a verb,
the process of defining and instantiating such a scheme. Correlation Tools: Tools that provide a measure of relation between two variables. DBMS:
Data Base Management
System Data Management: The management of the data being mined.
This includes data collection, preparation, evaluating data quality and
relevance, and data classification. Data Mining (DM): The detection, characterization, and exploitation of actionable patterns
in data. Data mining has tow
components: Knowledge Discovery and
Predictive Modeling DM Program Management:
management (cost, schedule, performance) of the DM process. The empirical experimental nature of DM as a
rapid prototyping effort necessitates the use of special management techniques. DM as rapid prototyping:
data mining, as an empirical search for hidden latent patterns, cannot
be completely planned in advance.
Therefore, it is usually conducted under a rapid prototyping (“spiral”)
methodology, allowing goals and methods to be adjusted as discoveries are made. DM Standards: DM is essentially the application of the scientific method to data analysis; it cannot be done haphazardly. Several methodologies are in use, SEMMA and CRISP-DM being predominant in the industry. A markup language for predictive modeling, PMML, is currently under development by a committee of industry practitioners. Data Preparation: the process of
conditioning data for analysis; includes normalization, registration, error
detection and correction, gap filling, coding, quantization, and formatting. Data Quality: General term referring to the readiness of data for processing. Data is of higher quality when it is representative of the domain, contains few gaps and outliers, and offers easy access to relevant actionable information. Data Representation: Data
types, formats and schemas Decision Trees: separate out data into sets of rules which
are likely to have a different effect on a target variable. Demographic and
Behavioral Data: data about entities that exhibit
behaviors, such as persons, companies, governments, etc. Demographic data describes what an entity is
(its attributes), while behavioral data describes what an entity does (actions,
motivations, history). Distributed data and information:
data required for analysis is often not available from a single source:
it is distributed. Once data has been
collected, this problem is encountered again with information: information is
often only found when many data items are brought together in the proper combination. Features: Symbolic representation of
attributes of a member of a population (Weight in points, revenue in dollars,
gender as M/F, etc.) Feature Set Operations: operations performed
on feature data, such as normalization, rounding, coding, etc. High-End Custom Applications (general non-model based regression): the
use of advanced adaptive regression methods for predictive modeling (e.g.,
neural networks, radial basis functions, support vector machines). These so-called “black box” methods are used
when the data or the domain are not well understood, or are extremely complex. HMI: Human Machine Interface. Refers to the means by which a computing
system and its users interact. Infrastructure: The
environment the data mining system will reside on. This will include system architecture,
supported languages and HMI. Knowledge Base: An
organized collection of heuristics relevant to a particular problem domain. Knowledge Base
Expert System (KBES): A predictive model
that applies codified expert-level human knowledge in a particular problem
domain according to an appropriate inference methodology. KBES are typically built for forensic
applications (Diagnostics, planning, classification. KBES are architecturally primitive and
strictly segregate heuristics (their knowledge base) from the inference engine. Knowledge Discovery (KD): The first component of data mining.
Systematically using manual and automated tools to detect and
characterize actionable patterns in data. Meta-Schemes: Model Management: The
method of managing models and results when using the models. Model Test and Evaluation: To
test and evaluation the models used to consider the best single or best
combination of models in addressing the problems at hand and satisfying the
objective. Neural Network (NN): Mathematical transform whose values are computed through the cooperation of many simple
transforms. Usually a synonym for
“Multi-layer Perception”. Online Analytical Processing (OLAP): Conventional data aggregation and representation for the support of
(mostly manual)data mining by an analyst:
“retrieve, segment, slice, dice, drilldown, count, display, report”. Operational Issues:
considerations that must be made when a sophisticated application is
ported from the development environment, where conditions are carefully
controlled, to the operational environment, where they are not. Problems in this area often arise as a result
of false assumptions made about the operational environment during development. Predictive Modeling: The second component of data mining:
Using the results of knowledge discovery to construct applications
(models) that solve business problems,
Predictive models are generally classifiers (detect “fraud”, categorize
customers, etc.) or predictors (estimate future revenue, predict “churn”,
etc.). Query and Reporting: an
OLAP function by which data satisfying a set of user-specified constraints are
accessed and formatted for presentation. Radical Basis Function (RBF): A very powerful “kernel-based” Classification paradigm. Relevance/Independence
of Features: features are relevant when they
contain information useful in addressing enterprise problems. Features are independent when the information
they contain is not present in other features. Rule:
A relationship
between facts expressed as a structured construct (e.g. IF-THEN-ELSE
statement). The rule is the fundamental
unit of domain knowledge in a KBES. Rule Induction: Creating rules directly from data without human assistance. Specification Suite: Establishing requirements and expectations. Statistical Tools (e.g., EXCEL): analysis tools based upon
sampling and statistical inferencing techniques (e.g., high-end OLAP). Supervise Learning: A
training process that uses known ground-truth for each training vector to
evaluate and improve the learning machine. Support Vectors
Modeling (SVM): A powerful predictive modeling technique that
creates classifiers y modeling class boundaries. Tools for cognitive engine parameter
selection: automated tools
for guiding the selection of training and operational settings for cognitive
engines, such as learning rates, momentum factors, termination conditions,
annealing schedules, etc. Tools/methods for application profiling (user, data): tools for assisting the developer of
cognitive engines in analyzing the problem domain and domain experts in order
to quickly and accurately focus data mining efforts. No automated tools exist, but manual
processes do. Tools and methods for model scoring and
evaluation: tools for
assessing the relative performance of cognitive engines. Includes such things as lift curves,
confusion matrices, ambiguity measures, visualization, statistical measures,
etc. Tools for predictive modeling paradigm
selection: automated tools
for assisting the developer of cognitive engines in selecting the proper
analysis and modeling paradigms (e.g., neural net vs. rule-based system.) Unsupervised Learning: A training process that detects and characterizes previously unspecified
patterns in data. Visualization:
Depiction of data in visual form so that quality and relationships may
be observed by a human analyst. White Box: Have insight into the workings
of the data mining system and how the outcome is produced. Related Links |








Practical Data Mining
How to Use This Book 