Forest-like EDA factorization model

We have adapted a forest-like EDA as the probability distribution [12,17]. This modelization creates a tree or a multiple tree forest using the mutual information metric to set the dependences between pairs of variables. The biggest problem is that the computational complexity significantly increases with respect to UMDA and, in conjunction with the size of the MS problem, makes its application sometimes unfeasible. New experiments with the SELDI datasets were not tackled because the running time of just a single run is very large. In contrast, for the MALDI datasets this time is constrained by three and four magnitude degrees.

The following two tables show the results collected for 50 runs of this forest approach (same running scheme as in the paper). The original results of the UMDA are not improved by this more complex factorization. The same behavior is observed if we change the classification paradigm to the IBL or k-NN. Results with a k value of 3 -in order to avoid ties- are collected in the last two tables.

Table: Multistart results produced by the population consensus proposal using a naïve Bayes classifier embedded in a forest-like EDA factorization model. Results are computed using 50 multistart runs.

	HCC	DGB
Total number of solutions throughout 50 runs	7,132	5,562
Mean number of solutions on each Pareto front	31.64	25
Mean number of peakbins per Pareto solution	30.30	18.15
Maximum accuracy	94.69 +/- 4.01	92.25+/-4.82
Peakbins	34.80 +/- 25.96	20.40+/-31.32

Table: Average accuracy estimations for the internal and the external evaluations for the naïve Bayes classifier and the forest-like EDA factorization model. Estimations are computed for each fold, in both the inner and outer loops and include their associated standard deviation.

	Inner accuracy	Outer accuracy
HCC	94.10 +/- 1.11	93.60 +/- 3.72
DGB	87.61 +/- 1.70	90.42 +/- 5.31

Table: Multistart results produced by the population consensus proposal using a k-NN classifier (k value is set to 3) embedded in a forest-like EDA factorization model. Results are computed using 50 multistart runs.

	HCC	DGB
Total number of solutions throughout 50 runs	6,930	7,058
Mean number of solutions on each Pareto front	31.12	31.64
Mean number of peakbins per Pareto solution	38.15	41.70
Maximum accuracy	96.67 +/- 3.65	89.87+/- 3.07
Peakbins	126.80 +/- 81.63	36.20 +/- 28.94

Table: Average accuracy estimations for the internal and the external evaluations for the k-NN classifier (k value is set to 3) and the forest-like EDA factorization model. Estimations are computed for each fold, in both the inner and outer loops and include their associated standard deviation.

	Inner accuracy	Outer accuracy
HCC	93.41 +/- 1.37	94.69 +/- 2.98
DGB	85.93 +/- 1.80	87.22 +/- 5.24