Forest-like EDA factorization model

We have adapted a forest-like EDA as the probability distribution [12,17]. This modelization creates a tree or a multiple tree forest using the mutual information metric to set the dependences between pairs of variables. The biggest problem is that the computational complexity significantly increases with respect to UMDA and, in conjunction with the size of the MS problem, makes its application sometimes unfeasible. New experiments with the SELDI datasets were not tackled because the running time of just a single run is very large. In contrast, for the MALDI datasets this time is constrained by three and four magnitude degrees.

The following two tables show the results collected for 50 runs of this forest approach (same running scheme as in the paper). The original results of the UMDA are not improved by this more complex factorization. The same behavior is observed if we change the classification paradigm to the IBL or k-NN. Results with a k value of 3 -in order to avoid ties- are collected in the last two tables.


Table: Multistart results produced by the population consensus proposal using a naïve Bayes classifier embedded in a forest-like EDA factorization model. Results are computed using 50 multistart runs.
  HCC DGB
Total number of solutions throughout 50 runs 7,132 5,562
Mean number of solutions on each Pareto front 31.64 25
Mean number of peakbins per Pareto solution 30.30 18.15
Maximum accuracy 94.69 +/- 4.01 92.25+/-4.82
Peakbins 34.80 +/- 25.96 20.40+/-31.32



Table: Average accuracy estimations for the internal and the external evaluations for the naïve Bayes classifier and the forest-like EDA factorization model. Estimations are computed for each fold, in both the inner and outer loops and include their associated standard deviation.
  Inner accuracy Outer accuracy
HCC 94.10 +/- 1.11 93.60 +/- 3.72
DGB 87.61 +/- 1.70 90.42 +/- 5.31



Table: Multistart results produced by the population consensus proposal using a k-NN classifier (k value is set to 3) embedded in a forest-like EDA factorization model. Results are computed using 50 multistart runs.
  HCC DGB
Total number of solutions throughout 50 runs 6,930 7,058
Mean number of solutions on each Pareto front 31.12 31.64
Mean number of peakbins per Pareto solution 38.15 41.70
Maximum accuracy 96.67 +/- 3.65 89.87+/- 3.07
Peakbins 126.80 +/- 81.63 36.20 +/- 28.94



Table: Average accuracy estimations for the internal and the external evaluations for the k-NN classifier (k value is set to 3) and the forest-like EDA factorization model. Estimations are computed for each fold, in both the inner and outer loops and include their associated standard deviation.
  Inner accuracy Outer accuracy
HCC 93.41 +/- 1.37 94.69 +/- 2.98
DGB 85.93 +/- 1.80 87.22 +/- 5.24