Content

Statistics

Three basic problems are studied in statistics
- Learning problem:
  - Estimate the probability distribution function of random variables from given data samples. This is an inverse problem, compared to the forward problem studied in probability theory. In probability theory, the pdf of a r.v. is given; one needs to calculate the probability of an event. In the learn problem, data samples are given; one needs to estimate the pdf or discover some knowledge from the data (like regression analysis).
- (Statistical) inference problem: estimate unknown parameters from observations; or decide hypothesis, given observations. (Logic-based inference is not statistical inference.)
- Evaluation of reliability of inference:
  - Cramer-Rao bound
  - generalization capability of a classifier
  - testing error rate upper bounded by training error rate and a function of VC dimension and training set size
  - bias and variance of estimate
    - resampling: Jackknife, bootstrap
  - trade-off between bias and variance
  - estimate distribution of error
  - convergence of estimates
Stochastic modeling (or system identification or mathematical modeling)
- Model a stochastic system:
  - Linear stochastic system (similar to Kalman filter problem setting): X_{n+1}= A*X_n+B*U_n; Y_n=C*X_n+D*V_n
  - Nonlinear (time-varying) stochastic system: X_{n+1}= f_n(X_n,U_n); Y_n=h_n(X_n,V_n)
- Model a stochastic process: AR process, MA process, ARMA process, ARIMA process, ARCH process
- Stochastic modeling involves the following two steps:
  1. Model selection (selecting a model from a family of models, e.g., AR processes). E.g., if a model is to be selected from AR processes, model selection is actually order selection, i.e., the order of AR processes.
  2. Estimation of parameters of the model.
Estimation
- Parameter estimation of a stochastic model/system:
  - E.g., mean, variance of a pdf
  - E.g., system identification
  - E.g., coefficients of a polynomial model (regression method)
- Regression method:
Detection, classification, pattern recognition
(Statistical) data analysis, data mining
VC dimension for a classifier for binary hypothesis test
- Given N data samples of pairs (x,y) where x is a vector and y is a label (i.e., index of a class/hypothesiss, taking value of -1 or 1), assume that we want to approximate the data by f_{\theta}(x), where f is a function, parameterized by \theta and the classifier is specified by the following rule: y=sign(f_{\theta}(x)).
- Another formulation: Given N data samples of pairs (x,y) where x is a vector and y is a label (i.e., index of a class/hypothesiss, taking value of 0 or 1), assume that we want to approximate the data by f_{\theta}(x), where f is a function, parameterized by \theta and the classifier is specified by the following rule: if f_{\theta}(x)>0, decide y=1; otherwise, decide y=0.
- Assume that there is no classification error for arbitrary configuration/value of N samples of pairs (x,y); that is, the N samples can be represented by f_{\theta}(x) without error; y=I_{f_{\theta}(x)>0}, where I is an indicator function. Then, the maximum N is called VC dimension.

3 stages for statistical signal processing (or machine learning, pattern classification, data mining).

Stage 1: Preprocessing. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.

Stage 2: Model building and validation (stochastic modeling). This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning. The outcome of stage 2 is a statistical processor. It involves the following two steps:

Model selection

Parameter estimation of the model

Stage 3: Deployment. Apply the statistical processor to new data in order to generate detections, classifications, predictions or estimates of the parameters.

A measure of model complexity: VC dimension

Suppose that we pick n data points and assign labels of + or – to them at random. If our model class (e.g. a neural net with a certain number of hidden units) is powerful enough to learn any association of labels with data. Maybe we can characterize the power of a model class by asking how many data points it can learn perfectly for all possible assignments of labels (there are 2^n possible assignments of labels for n data points). This number of data points is called the Vapnik-Chervonenkis (VC) dimension.
Refer to page 494, Duda et al.'s book "Pattern Classification", for the capacity of a separating plane.
An example of VC dimension:
- Suppose our model class is a hyperplane.
  In 2-D, we can find a plane (i.e. a line) to deal with any labeling of three points but we cannot deal with 4 points
  
  So the VC dimension of a hyperplane in 2-D is 3.
  In k dimensions it is k+1.
  Its just a coincidence that the VC dimension of a hyperplane is almost identical to the number of parameters it takes to define a hyperplane; actually, the number of parameters needed to define a hyperplane is (VC dimension -1).

Percentile or quantile
- Percentile/quantile can be regarded as an inverse function of CDF (cumulative distribution function) F(x).
- A q quantile x is equal to F^{-1}(q) if F(x)=q.
Histogram
- It can be regarded as an estimate of CDF. The real line is divided into intervals/bins. Count the numbers of outcomes that fall into each of the bins. These numbers and associated bins constitute the histogram of a random variable.
Confidence interval
- When you do parameter estimation, you first get an estimate of the parameter X by whatever means. This estimate \hat{X} is a random variable. How far way is this estimate from the true parameter is unknown.
  - Assuming Gaussian measurement noise with unknown variance, this estimate follows student t distribution with degree of freedom n, where n is the number of samples used in the estimation. For confidence q (0<q<1), the q confidence interval is (\hat{X} - x_{(1+q)/2}, \hat{X} + x_{(1+q)/2}), where x_{(1+q)/2} is (1+q)/2 quantile for the student t distribution.
  - Assuming Gaussian measurement noise with known variance, this estimate follows Gaussian distribution. For confidence q, the q confidence interval is (\hat{X} - x_{(1+q)/2}, \hat{X} + x_{(1+q)/2}), where x_{(1+q)/2} is (1+q)/2 quantile for the Gaussian distribution.
- As n goes to infinity, student t distribution converges to a normal distribution (central limit theorem). Student t distribution with degree n, gets more centered about its mean as n gets larger. The shape of student t PDF likes that of a normal PDF.
- If degree n=1, student t distribution is Cauchy distribution.
Coefficient of variation = standard deviation/mean
- squared coefficient of variation may also be useful.
Index of dispersion for counts (IDC) I(t) for a stochastic process A(t)
- IDC = variance/mean, that is, I(t)=V(t)/m(t), where m(t) and V(t) are mean and variance of the r.v. A(t) at time t.
- The reason that IDC is so defined is that the first two moments seem to be the minimum amount of information needed to describe the process A(t).
Correlation coefficient of two r.v.'s
- Denoted correlation coefficient by \rho
- \rho(X,Y) = covariance(X,Y)/(standard_deviation(X)*standard_deviation(Y))
- \rho characterizes statistical linear dependence between two r.v.'s. (Note statistical linear dependence is different from linear dependence in linear algebra.)
A correlation coefficient is a number between -1 and 1 which measures the degree to which two variables are linearly related. If there is perfect linear relationship with positive slope between the two variables, i.e., Y=a*X with the real number a>0, we have a correlation coefficient of 1; if there is positive correlation, whenever one variable has a high (low) value, so does the other. If there is a perfect linear relationship with negative slope between the two variables, i.e., Y=a*X with the real number a<0, we have a correlation coefficient of -1; if there is negative correlation, whenever one variable has a high (low) value, the other has a low (high) value. A correlation coefficient of 0 means that the variables are uncorrelated.
Independence, uncorrelatedness, orthogonality of two r.v.'s
Autocorrelation function of a process:
Cross-correlation function of two process
Confidence interval:
- It is just a subjective measure;
- From central limit theorem, we know the sample mean m falls in the interval (\mu - Z_{\alpha/2}*\sigma/sqrt(n), \mu + Z_{\alpha/2}*\sigma/sqrt(n)). Conversely, the expectation \mu falls in the interval (m - Z_{\alpha/2}*\sigma/sqrt(n), m + Z_{\alpha/2}*\sigma/sqrt(n)).

Hypothesis tests for testing goodness-of-fit (making binary decision: good or bad fit)

Goodness of fit tests are not only for regression (testing under-fitting and over-fitting of a regression, statistical significance of a parameter estimate) but also for PDF/PMF matching (testing whether two data set have the same PDF/PMF, or whether a data set follows a specific PDF/PMF).

For regression, goodness-of-fit tests consist of two types: test for underfitting and test for overfitting. Refer to

Y. BarShalom, X. R. Li and T. Kirubarajan, Estimation with Applications to Tracking and Navigation: Algorithms and Software for Information Extraction, J. Wiley and Sons, 2001. page 154.

Gillick, L. and Cox, S. J. Some statistical issues in the comparison of speech recognition algorithms, Proc. ICASSP- 89, Glasgow, pp. 532-535,
June 1989. (test the statistical significance of parameter estimation in speech recognition)
　

ANOVA (analysis of variance) is used for change detection in controlled experiments. The change detection is formulated as a hypothesis test, i.e., testing whether several samples have the same mean. Specifically, ANOVA is to test the ratio of the variance between samples/groups to that with a sample/group (which is an F test).

For PDF/PMF matching (modeling), goodness-of-fit tests include

Kolmogorov-Smirnov test for continuous PDF modeling.
Chi-square test for (discrete) PMF modeling.

We can also use hypothesis testing to compare the performance of different algorithms (whether an algorithm is better than another one in the statistically significant sense. Refer to

Y. BarShalom, X. R. Li and T. Kirubarajan, Estimation with Applications to Tracking and Navigation: Algorithms and Software for Information Extraction, J. Wiley and Sons, 2001. page 80.

One-sided/two-sided test

One-sided (one-tailed) test: the region for false alarm only contains the upper tail.
Two-sided (two-tailed) test: the region for false alarm only contains both tails.

Experiment design:

Design a measurement device to collect data for some statistical test. Use data preprocessing to reduce possible artifacts. Choose a hypothesis test to establish ONE best model among all meaningful model sets qualified to explain the data.
　

Discriminant analysis is a technique for classifying a set of observations into predefined classes. The purpose is to determine the class of an observation based on a set of variables known as predictors or input variables. The model is built based on a set of observations for which the classes are known. This set of observations is sometimes referred to as the training set. Based on the training set , the technique constructs a set of linear functions of the predictors, known as discriminant functions, such that L = b₁x₁+ b₂x₂+ … + b_nx_n + c , where the b's are discriminant coefficients, the x's are the input variables or predictors and c is a constant.

These discriminant functions are used to predict the class of a new observation with unknown class. For a k class problem k discriminant functions are constructed. Given a new observation, all the k discriminant functions are evaluated and the observation is assigned to class i if the i^th discriminant function has the highest value.

i^*=arg max_i{f_i(\vec{x},\vec{\alpha})} where f_i(\vec{x},\vec{\alpha}) is the discriminant function of i-th class.

Regression analysis:

http://en.wikipedia.org/wiki/Regression_analysis

Parameter estimation:

Method of moments: The method of moments equates sample moments to parameter estimates (the parameters are shape, location, and scale for the underlying probability distribution). It is simplest and not subject to any optimal criterion.
Maximum likelihood: need likelihood function
Minimum mean square error (MMSE): need prior probability and likelihood function
Least square: if prior probability is not known, use least square estimate. Covariance of measurement errors need not be known. If known, use it.

Data Mining

Data Mining is an analytic process designed to explore data (usually large amounts of data - typically business or market related) in search of consistent patterns and/or systematic relationships between variables, and then to validate the findings by applying the detected patterns to new subsets of data. The ultimate goal of data mining is prediction - and predictive data mining is the most common type of data mining and one that has the most direct business applications. The process of data mining consists of three stages: (1) the initial exploration, (2) model building or pattern identification with validation/verification, and (3) deployment (i.e., the application of the model to new data in order to generate predictions).

Stage 1: Exploration. This stage usually starts with data preparation which may involve cleaning data, data transformations, selecting subsets of records and - in case of data sets with large numbers of variables ("fields") - performing some preliminary feature selection operations to bring the number of variables to a manageable range (depending on the statistical methods which are being considered). Then, depending on the nature of the analytic problem, this first stage of the process of data mining may involve anywhere between a simple choice of straightforward predictors for a regression model, to elaborate exploratory analyses using a wide variety of graphical and statistical methods (see Exploratory Data Analysis (EDA)) in order to identify the most relevant variables and determine the complexity and/or the general nature of models that can be taken into account in the next stage.

Stage 2: Model building and validation. This stage involves considering various models and choosing the best one based on their predictive performance (i.e., explaining the variability in question and producing stable results across samples). This may sound like a simple operation, but in fact, it sometimes involves a very elaborate process. There are a variety of techniques developed to achieve that goal - many of which are based on so-called "competitive evaluation of models," that is, applying different models to the same data set and then comparing their performance to choose the best. These techniques - which are often considered the core of predictive data mining - include: Bagging (Voting, Averaging), Boosting, Stacking (Stacked Generalizations), and Meta-Learning.

Stage 3: Deployment. That final stage involves using the model selected as best in the previous stage and applying it to new data in order to generate predictions or estimates of the expected outcome.

The concept of Data Mining is becoming increasingly popular as a business information management tool where it is expected to reveal knowledge structures that can guide decisions in conditions of limited certainty. Recently, there has been increased interest in developing new analytic techniques specifically designed to address the issues relevant to business Data Mining (e.g., Classification Trees), but Data Mining is still based on the conceptual principles of statistics including the traditional Exploratory Data Analysis (EDA) and modeling and it shares with them both some components of its general approaches and specific techniques.

However, an important general difference in the focus and purpose between Data Mining and the traditional Exploratory Data Analysis (EDA) is that Data Mining is more oriented towards applications than the basic nature of the underlying phenomena. In other words, Data Mining is relatively less concerned with identifying the specific relations between the involved variables. For example, uncovering the nature of the underlying functions or the specific types of interactive, multivariate dependencies between variables are not the main goal of Data Mining. Instead, the focus is on producing a solution that can generate useful predictions. Therefore, Data Mining accepts among others a "black box" approach to data exploration or knowledge discovery and uses not only the traditional Exploratory Data Analysis (EDA) techniques, but also such techniques as Neural Networks which can generate valid predictions but are not capable of identifying the specific nature of the interrelations between the variables on which the predictions are based.

Data Mining is often considered to be "a blend of statistics, AI [artificial intelligence], and data base research" (Pregibon, 1997, p. 8), which until very recently was not commonly recognized as a field of interest for statisticians, and was even considered by some "a dirty word in Statistics" (Pregibon, 1997, p. 8). Due to its applied importance, however, the field emerges as a rapidly growing and major area (also in statistics) where important theoretical advances are being made (see, for example, the recent annual International Conferences on Knowledge Discovery and Data Mining, co-hosted by the American Statistical Association).

There are numerous books that review the theory and practice of data mining; the following books offer a representative sample of recent general books on data mining, representing a variety of approaches and perspectives:

Berry, M., J., A., & Linoff, G., S., (2000). Mastering data mining. New York: Wiley.

Edelstein, H., A. (1999). Introduction to data mining and knowledge discovery (3rd ed). Potomac, MD: Two Crows Corp.

Fayyad, U. M., Piatetsky-Shapiro, G., Smyth, P., & Uthurusamy, R. (1996). Advances in knowledge discovery & data mining. Cambridge, MA: MIT Press.

Han, J., Kamber, M. (2000). Data mining: Concepts and Techniques. New York: Morgan-Kaufman.

Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning : Data mining, inference, and prediction. New York: Springer.

Pregibon, D. (1997). Data Mining. Statistical Computing and Graphics, 7, 8.

Weiss, S. M., & Indurkhya, N. (1997). Predictive data mining: A practical guide. New York: Morgan-Kaufman.

Westphal, C., Blaxton, T. (1998). Data mining solutions. New York: Wiley.

Witten, I. H., & Frank, E. (2000). Data mining. New York: Morgan-Kaufmann.

Crucial Concepts in Data Mining

Bagging (Voting, Averaging)
The concept of bagging (voting for classification, averaging for regression-type problems with continuous dependent variables of interest) applies to the area of predictive data mining, to combine the predicted classifications (prediction) from multiple models, or from the same type of model for different learning data. It is also used to address the inherent instability of results when applying complex models to relatively small data sets. Suppose your data mining task is to build a model for predictive classification, and the dataset from which to train the model (learning data set, which contains observed classifications) is relatively small. You could repeatedly sub-sample (with replacement) from the dataset, and apply, for example, a tree classifier (e.g., C&RT and CHAID) to the successive samples. In practice, very different trees will often be grown for the different samples, illustrating the instability of models often evident with small datasets. One method of deriving a single prediction (for new observations) is to use all trees found in the different samples, and to apply some simple voting: The final classification is the one most often predicted by the different trees. Note that some weighted combination of predictions (weighted vote, weighted average) is also possible, and commonly used. A sophisticated (machine learning) algorithm for generating weights for weighted prediction or voting is the Boosting procedure.

Boosting
The concept of boosting applies to the area of predictive data mining, to generate multiple models or classifiers (for prediction or classification), and to derive weights to combine the predictions from those models into a single prediction or predicted classification (see also Bagging).

A simple algorithm for boosting works like this: Start by applying some method (e.g., a tree classifier such as C&RT or CHAID) to the learning data, where each observation is assigned an equal weight. Compute the predicted classifications, and apply weights to the observations in the learning sample that are inversely proportional to the accuracy of the classification. In other words, assign greater weight to those observations that were difficult to classify (where the misclassification rate was high), and lower weights to those that were easy to classify (where the misclassification rate was low). In the context of C&RT for example, different misclassification costs (for the different classes) can be applied, inversely proportional to the accuracy of prediction in each class. Then apply the classifier again to the weighted data (or with different misclassification costs), and continue with the next iteration (application of the analysis method for classification to the re-weighted data).

Boosting will generate a sequence of classifiers, where each consecutive classifier in the sequence is an "expert" in classifying observations that were not well classified by those preceding it. During deployment (for prediction or classification of new cases), the predictions from the different classifiers can then be combined (e.g., via voting, or some weighted voting procedure) to derive a single best prediction or classification.

Note that boosting can also be applied to learning methods that do not explicitly support weights or misclassification costs. In that case, random sub-sampling can be applied to the learning data in the successive steps of the iterative boosting procedure, where the probability for selection of an observation into the subsample is inversely proportional to the accuracy of the prediction for that observation in the previous iteration (in the sequence of iterations of the boosting procedure).

CRISP
See Models for Data Mining.

Data Preparation (in Data Mining)
Data preparation and cleaning is an often neglected but extremely important step in the data mining process. The old saying "garbage-in-garbage-out" is particularly applicable to the typical data mining projects where large data sets collected via some automatic methods (e.g., via the Web) serve as the input into the analyses. Often, the method by which the data where gathered was not tightly controlled, and so the data may contain out-of-range values (e.g., Income: -100), impossible data combinations (e.g., Gender: Male, Pregnant: Yes), and the like. Analyzing data that has not been carefully screened for such problems can produce highly misleading results, in particular in predictive data mining.

Data Reduction (for Data Mining)
The term Data Reduction in the context of data mining is usually applied to projects where the goal is to aggregate or amalgamate the information contained in large datasets into manageable (smaller) information nuggets. Data reduction methods can include simple tabulation, aggregation (computing descriptive statistics) or more sophisticated techniques like clustering, principal components analysis, etc.

Deployment
The concept of deployment in predictive data mining refers to the application of a model for prediction or classification to new data. After a satisfactory model or set of models has been identified (trained) for a particular application, one usually wants to deploy those models so that predictions or predicted classifications can quickly be obtained for new data. For example, a credit card company may want to deploy a trained model or set of models (e.g., neural networks, meta-learner) to quickly identify transactions which have a high probability of being fraudulent.

Drill-Down Analysis
The concept of drill-down analysis applies to the area of data mining, to denote the interactive exploration of data, in particular of large databases. The process of drill-down analyses begins by considering some simple break-downs of the data by a few variables of interest (e.g., Gender, geographic region, etc.). Various statistics, tables, histograms, and other graphical summaries can be computed for each group. Next one may want to "drill-down" to expose and further analyze the data "underneath" one of the categorizations, for example, one might want to further review the data for males from the mid-west. Again, various statistical and graphical summaries can be computed for those cases only, which might suggest further break-downs by other variables (e.g., income, age, etc.). At the lowest ("bottom") level are the raw data: For example, you may want to review the addresses of male customers from one region, for a certain income group, etc., and to offer to those customers some particular services of particular utility to that group.

Feature Selection
One of the preliminary stage in predictive data mining, when the data set includes more variables than could be included (or would be efficient to include) in the actual model building phase (or even in initial exploratory operations), is to select predictors from a large list of candidates. For example, when data are collected via automated (computerized) methods, it is not uncommon that measurements are recorded for thousands or hundreds of thousands (or more) of predictors. The standard analytic methods for predictive data mining, such as neural network analyses, classification and regression trees, generalized linear models, or general linear models become impractical when the number of predictors exceed more than a few hundred variables.

Feature selection selects a subset of predictors from a large list of candidate predictors without assuming that the relationships between the predictors and the dependent or outcome variables of interest are linear, or even monotone. Therefore, this is used as a pre-processor for predictive data mining, to select manageable sets of predictors that are likely related to the dependent (outcome) variables of interest, for further analyses with any of the other methods for regression and classification.

Machine Learning
Machine learning, computational learning theory, and similar terms are often used in the context of Data Mining, to denote the application of generic model-fitting or classification algorithms for predictive data mining. Unlike traditional statistical data analysis, which is usually concerned with the estimation of population parameters by statistical inference, the emphasis in data mining (and machine learning) is usually on the accuracy of prediction (predicted classification), regardless of whether or not the "models" or techniques that are used to generate the prediction is interpretable or open to simple explanation. Good examples of this type of technique often applied to predictive data mining are neural networks or meta-learning techniques such as boosting, etc. These methods usually involve the fitting of very complex "generic" models, that are not related to any reasoning or theoretical understanding of underlying causal processes; instead, these techniques can be shown to generate accurate predictions or classification in crossvalidation samples.

Meta-Learning
The concept of meta-learning applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different. In this context, this procedure is also referred to as Stacking (Stacked Generalization).

Suppose your data mining project includes tree classifiers, such as C&RT and CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). The predictions from different classifiers can be used as input into a meta-learner, which will attempt to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

One can apply meta-learners to the results from different meta-learners to create "meta-meta"-learners, and so on; however, in practice such exponential increase in the amount of data processing, in order to derive an accurate prediction, will yield less and less marginal utility.

Models for Data Mining
In the business environment, complex data mining projects may require the coordinate efforts of various experts, stakeholders, or departments throughout an entire organization. In the data mining literature, various "general frameworks" have been proposed to serve as blueprints for how to organize the process of gathering data, analyzing data, disseminating results, implementing results, and monitoring improvements.

One such model, CRISP (Cross-Industry Standard Process for data mining) was proposed in the mid-1990s by a European consortium of companies to serve as a non-proprietary standard process model for data mining. This general approach postulates the following (perhaps not particularly controversial) general sequence of steps for data mining projects:

Another approach - the Six Sigma methodology - is a well-structured, data-driven methodology for eliminating defects, waste, or quality control problems of all kinds in manufacturing, service delivery, management, and other business activities. This model has recently become very popular (due to its successful implementations) in various American industries, and it appears to gain favor worldwide. It postulated a sequence of, so-called, DMAIC steps -

- that grew up from the manufacturing, quality improvement, and process control traditions and is particularly well suited to production environments (including "production of services," i.e., service industries).

Another framework of this kind (actually somewhat similar to Six Sigma) is the approach proposed by SAS Institute called SEMMA -

- which is focusing more on the technical activities typically involved in a data mining project.

All of these models are concerned with the process of how to integrate data mining methodology into an organization, how to "convert data into information," how to involve important stake-holders, and how to disseminate the information in a form that can easily be converted by stake-holders into resources for strategic decision making.

Some software tools for data mining are specifically designed and documented to fit into one of these specific frameworks.

Predictive Data Mining
The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit card company may want to engage in predictive data mining, to derive a (trained) model or set of models (e.g., neural networks, meta-learner) that can quickly identify transactions which have a high probability of being fraudulent. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers), in which case drill-down descriptive and exploratory methods would be applied. Data reduction is another possible objective for data mining (e.g., to aggregate or amalgamate the information in very large data sets into useful and manageable chunks).

SEMMA
See Models for Data Mining.

Stacked Generalization
See Stacking.

Stacking (Stacked Generalization)
The concept of stacking (short for Stacked Generalization) applies to the area of predictive data mining, to combine the predictions from multiple models. It is particularly useful when the types of models included in the project are very different.

Suppose your data mining project includes tree classifiers, such as C&RT or CHAID, linear discriminant analysis (e.g., see GDA), and Neural Networks. Each computes predicted classifications for a crossvalidation sample, from which overall goodness-of-fit statistics (e.g., misclassification rates) can be computed. Experience has shown that combining the predictions from multiple methods often yields more accurate predictions than can be derived from any one method (e.g., see Witten and Frank, 2000). In stacking, the predictions from different classifiers are used as input into a meta-learner, which attempts to combine the predictions to create a final best predicted classification. So, for example, the predicted classifications from the tree classifiers, linear model, and the neural network classifier(s) can be used as input variables into a neural network meta-classifier, which will attempt to "learn" from the data how to combine the predictions from the different models to yield maximum classification accuracy.

Other methods for combining the prediction from multiple models or methods (e.g., from multiple datasets used for learning) are Boosting and Bagging (Voting).

Text Mining
While Data Mining is typically concerned with the detection of patterns in numeric data, very often important (e.g., critical to business) information is stored in the form of text. Unlike numeric data, text is often amorphous, and difficult to deal with. Text mining generally consists of the analysis of (multiple) text documents by extracting key phrases, concepts, etc. and the preparation of the text processed in that manner for further analyses with numeric data mining techniques (e.g., to determine co-occurrences of concepts, key phrases, names, addresses, product names, etc.).

Voting
See Bagging.

Special Properties of Normal Samples

Suppose that (X₁, X₂, ..., X_n) is a random sample from the normal distribution with mean µ and standard deviation d. In this section, we will establish several special properties of the sample mean, the sample variance, and some other important statistics.

The Sample Mean

First recall that the sample mean is

M = (1 / n)_{i
= 1, ..., n} X_i.

The distribution of the M follows easily from basic properties of independent normal variables:

$Mathematical Exercise$ 1. Show that M is normally distributed with mean µ and variance d² / n.

$Mathematical Exercise$ 2. Show that Z = (M - µ) / (d / n^1/2) has the standard normal distribution.

The standard score Z will appear in several of the derivations below.

The Estimator of `d`² when µ is Known

Recall that if µ is known, a natural estimator of the variance d² is

W² = (1 / n)_{i
= 1, ..., n} (X_i - µ)².

Although the assumption that µ is known is usually artificial, W² is very easy to analyze and it will be used in some of the derivations below.

$Mathematical Exercise$ 3. Show that nW² / d² has the chi-square distribution with n degrees of freedom.

$Mathematical Exercise$ 4. Use the result of the previous exercise to show that

E(W²) = d².
var(W²) = 2d⁴ / n.

Independence of the Sample Mean and Variance

Recall that the sample variance is

S² = [1 / (n - 1)]_{i
= 1, ..., n} (X_i - M)².

The next series of exercises show that the sample mean M and the sample variance S² are independent. First we will note a simple but interesting fact, that holds for a random sample from any distribution, not just the normal.

$Mathematical Exercise$ 5. Use basic properties of covariance to show that for each i, M and X_i - M are uncorrelated:

Our analysis hinges on the sample mean M and the vector of deviations from the sample mean:

Y = (X₁ - M, X₂ - M, ..., X_n_{- 1} - M).

$Mathematical Exercise$ 6. Show that

X_n - M = -_{i
= 1, ..., n - 1} (X_i - M).

and hence show that S² can be written as a function of Y.

$Mathematical Exercise$ 7. Show that the M and the vector Y have a joint multivariate normal distribution.

$Mathematical Exercise$ 8. Use the results of the previous exercises to show that M and the vector Y are independent.

$Mathematical Exercise$ 9. Finally, show that M and S² are independent.

The Sample Variance

We can now determine the distribution of the sample variance S².

$Mathematical Exercise$ 10. Show that nW² / d² = (n - 1)S² / d² + Z² where W² and Z are as given above.
Hint: In the sum on the left, add and subtract M, and expand.

$Mathematical Exercise$ 11. Show that (n - 1) S² / d² has the chi-squared distribution with n - 1 degrees of freedom.
Hint: Use the result of the previous exercise, independence, and moment generating functions.

$Mathematical Exercise$ 12. Use the result of the previous exercise to show that

E(S²) = d².
var(S²) = 2d⁴ / (n - 1)

Of course, these are special cases of the general results obtained earlier.

The `T` Statistic

The next sequence of exercises will derive the distribution of

T = (M - µ) / (S / n^1/2).

$Mathematical Exercise$ 13. Show that T = Z / [V / (n - 1)]^1/2. where Z is as above and where V = (n - 1) S² / d².

$Mathematical Exercise$ 14. Use previous results to show that T has thet distribution with n - 1 degrees of freedom.

The random variable T plays a critical role in constructing interval estimates for µ and performing hypothesis tests for µ.