## Cross-validation in RapidMiner

Cross-validation is a standard statistical method to estimate the generalization error of a predictive model. In $k$-fold cross-validation a training set is divided into $k$ equal-sized subsets. Then the following procedure is repeated for each subset: a model is built using the other $(k - 1)$ subsets as the training set and its performance is evaluated on the current subset. This means that each subset is used for testing exactly once. The result of the cross-validation is the average of the performances obtained from the $k$ rounds.

This post explains how to interpret cross-validation results in RapidMiner. For demonstration purposes, we consider the following simple RapidMiner process that is available here:

The Read URL operator reads the yellow-small+adult-stretch.data file, a subset of the Balloons Data Set available from the UCI Machine Learning Repository. Since this data set contains only 16 examples, it is very easy to perform all calculations in your head.

The Set Role operator marks the last attribute as the one that provides the class labels. The number of validations is set to 3 on the X-Validation operator, that will result a 5-5-6 partitioning of the examples in our case.

In the training subprocess of the cross-validation process a decision tree classifier is built on the current training set. In the testing subprocess the accuracy of the decision tree is computed on the test set.

The result of the process is the following PerformanceVector:

74.44 is obviously the arithmetic mean of the accuracies obtained from the three rounds and 10.30 is their standard deviation. However, it is not clear how to interpret the confusion matrix below and the value labelled with the word makro. You may ask how a single confusion matrix is returned if several models are built and evaluated in the cross-validation process.

The Write as Text operator in the inner testing subprocess writes the performance vectors to a text file that helps us to understand the results above. The file contains the confusion matrices obtained from each round together with the corresponding accuracy values as shown below:

```13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [1]:
13.04.2012 22:07:35 PerformanceVector:
accuracy: 60.00%
ConfusionMatrix:
True:	T	F
T:	0	0
F:	2	3

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [2]:
13.04.2012 22:07:35 PerformanceVector:
accuracy: 83.33%
ConfusionMatrix:
True:	T	F
T:	2	0
F:	1	3

13.04.2012 22:07:35 Results of ResultWriter 'Write as Text' [3]:
13.04.2012 22:07:35 PerformanceVector:
accuracy: 80.00%
ConfusionMatrix:
True:	T	F
T:	2	1
F:	0	2
```

Notice that the confusion matrix on the PerformanceVector (Performance) tab is simply the sum the three confusion matrices. The value labelled with the word mikro (75) is actually the accuracy computed from this aggregated confusion matrix. A performance calculated this way is called mikro average, while the mean of the averages is called makro average. Note that the confusion matrix behind the mikro average is constructed by evaluating different models on different test sets.