To see if all these differences are statistically significant, we use a hypothesis t-test of equal means to make comparisons between all fold accuracies (pairwise combinations of inner fold against outer estimations). The test rejects the null hypothesis of equal means in all cases where p-values are always less than 0.01. These results stress the fact that estimations made based on inner sets only are always too optimistic and, thus, not fair to the real data. A fair accuracy estimation by means of an inner-outer scheme is necessary.

Inner accuracy | Outer accuracy | |

OVA | 99.65 +/- 0.54 | 98.37 +/- 1.94 |

TOX | 92.32 +/- 6.54 | 88.35 +/- 10.56 |

HCC | 98.31 +/- 1.00 | 93.45 +/- 4.17 |

DGB | 95.64 +/- 1.24 | 90.49 +/- 5.72 |

These results also show that the inner estimations have a low variance, whereas the variance in the outer estimation is up to an order of magnitude greater. This variance is explained by the fact that the inner models overfit to that fold's training set and because their generalization power degrades when unseen instances are tested.