Random forest (classification)
Open in:QDB ExplorerQDB Predictor
Name | Type | n | Accuracy |
---|---|---|---|
Training set | training | 673 | 1.000 |
Out of bag set i | internal validation | 673 | 0.854 |
Validation set | external validation | 334 | 0.874 |
Random forest (classification)
Open in:QDB ExplorerQDB Predictor
Name | Type | n | Accuracy |
---|---|---|---|
Training set | training | 673 | 0.878 |
Out of bag set i | internal validation | 673 | 0.842 |
Validation set | external validation | 334 | 0.844 |
Random forest (classification)
Open in:QDB ExplorerQDB Predictor
Name | Type | n | Accuracy |
---|---|---|---|
Training set | training | 673 | 0.767 |
Out of bag set i | internal validation | 673 | 0.761 |
Validation set | external validation | 334 | 0.737 |
When using this QDB archive, please cite (see details) it together with the original article:
Piir, G.; Sild, S.; Maran, U. Data for: Classifying bio-concentration factor with random forest algorithm, influence of the bio-accumulative vs. non-bio-accumulative compound ratio to modelling result, and applicability domain for random forest model. QsarDB repository, QDB.116. 2014. https://doi.org/10.15152/QDB.116
Piir, G.; Sild, S.; Maran, U. Classifying bio-concentration factor with random forest algorithm, influence of the bio-accumulative vs. non-bio-accumulative compound ratio to modelling result, and applicability domain for random forest model. SAR QSAR Environ. Res. 2014, 25, 967-981. https://doi.org/10.1080/1062936X.2014.969310
Title: | Piir, G.; Sild, S.; Maran, U. Classifying bio-concentration factor with random forest algorithm, influence of the bio-accumulative vs. non-bio-accumulative compound ratio to modelling result, and applicability domain for random forest model. SAR QSAR Environ. Res. 2014, 25, 967-981. |
Abstract: | In environmental risk assessment, the bio-concentration factor (BCF) is a widely used parameter in the estimation of the bio-accumulation potential of chemicals. BCF data often have an uneven distribution of classes (bio-accumulative vs. non-bio-accumulative), which could severely bias the classification results towards the prevailing class. The present study focuses on the influence of uneven distribution of the classes in training phase of Random Forest (RF) classification models. Three different training set designs were used and descriptors selected to the models based on the occurrence frequency in RF trees and considering the mechanistic aspects they reflect. Models were compared and their classification performance was analysed, indicating good predictive characteristics (sensitivity = 0.90 and specificity = 0.83) for the balanced set; also imbalanced sets have their strengths in certain application scenarios. The confidence of classifications was assessed with a new schema for the applicability domain that makes use of the RF proximity matrix by analysing the similarity between the predicted compound and the training set of the model. All developed models were made available in the transparent, accessible and reproducible way in QsarDB repository (http://dx.doi.org/10.15152/QDB.116). |
URI: | http://hdl.handle.net/10967/116
http://dx.doi.org/10.15152/QDB.116 |
Date: | 2014-07-30 |
Name | Description | Format | Size | View |
---|---|---|---|---|
2014SQER967.qdb.zip | Random Forest classification models for BCF | application/zip | 1.302Mb | View/ |