Differences in the accuracy estimations between the outer and inner loops

As discussed in Section V of the paper, some works have already pointed out the misleading results produced when the same sample set is used to search relevant features and the same set is used again to estimate a classification accuracy [23,16]. Table 2 illustrates this point numerically. For each dataset, the best solution found by the UMDA is evaluated both in the internal loop (inner accuracy), which guides the search, and with the external test set (outer accuracy), the set of new instances not seen in the search-train process. Values in Table 2 reflect the average estimations in both cases, plus their associated standard deviation.

To see if all these differences are statistically significant, we use a hypothesis t-test of equal means to make comparisons between all fold accuracies (pairwise combinations of inner fold against outer estimations). The test rejects the null hypothesis of equal means in all cases where p-values are always less than 0.01. These results stress the fact that estimations made based on inner sets only are always too optimistic and, thus, not fair to the real data. A fair accuracy estimation by means of an inner-outer scheme is necessary.

**Table 2:** Average accuracy estimations for the internal and the external evaluations. Estimations are computed for each fold, in both the inner and outer loops and include their associated standard deviation.
	Inner accuracy	Outer accuracy
OVA	99.65 +/- 0.54	98.37 +/- 1.94
TOX	92.32 +/- 6.54	88.35 +/- 10.56
HCC	98.31 +/- 1.00	93.45 +/- 4.17
DGB	95.64 +/- 1.24	90.49 +/- 5.72

These results also show that the inner estimations have a low variance, whereas the variance in the outer estimation is up to an order of magnitude greater. This variance is explained by the fact that the inner models overfit to that fold's training set and because their generalization power degrades when unseen instances are tested.