PARC publications

PARC publications http://hdl.handle.net/10967/274 Wed, 22 Jul 2026 22:40:23 GMT 2026-07-22T22:40:23Z PARC publications https://qsardb.org:443/repository/bitstream/id/654686eb-9acb-4173-a6ef-5ed5d7061828/ http://hdl.handle.net/10967/274 Piir, G.; Sild, S.; Spilioti, E.; Nikolopoulou, D.; Katsanou, E.; Langezaal, I.; Maran, U. Classification of Thyroid Peroxidase (TPO) Inhibitors Using Transfer Learning with SMILES Embeddings. Chemical Research in Toxicology 2026. http://hdl.handle.net/10967/272 Thyroid hormones (THs) regulate many processes in mammals and, therefore, affect every organ in the body. Thyroid peroxidase (TPO) is an essential enzyme for the successful biosynthesis of THs. Although TPO inhibition is a well-documented molecular initiating event (MIE) in thyroid hormone system disruption adverse outcome pathways (AOPs), experimental methods and computational models to assess TPO activity are lacking. Efficient computational new approach methodologies (NAMs) are a viable solution for identifying TPO inhibitors from a large pool of agrochemicals. The aim of this study was to investigate the suitability of SMILES embeddings generated using a specialized language model (SLM) based on a pretrained deep neural network (DNN) for applying a transfer learning approach in the development of quantitative structure−activity relationships for classifying TPO inhibitors. Traditional theoretical molecular descriptors were used for comparison. Two different molecular descriptor sets resulted in Random Forest (RF) models that performed similarly on the training and test sets, while the sensitivity for the external validation set was substantially different between the two models (0.788 vs 0.490). Comparison of the predictions with the TPO inhibition data of the chemicals assessed by EFSA and EU-NETVAL laboratories showed good agreement. At the same time, analysis of experimental data from other sources showed some conflicting estimates. This suggests that further and more precise studies are needed for some compounds. This study advances in silico methodologies by implementing transfer learning for QSAR modeling from text representations (e.g., SMILES) using the pretrained Bidirectional Encoder Representations from Transformers (BERT) architecture. While traditional QSAR approach relies on molecular descriptors, this evaluation shows that model-generated SMILES embeddings can expand the applicability domain, indicating a more robust representation of structural information compared to traditional molecular descriptors. Wed, 27 May 2026 17:25:57 GMT http://hdl.handle.net/10967/272 2026-05-27T17:25:57Z Belfield, S. J.; Cronin, M. T. D.; Enoch, S. J.; Firman, J. W. Guidance for Good Practice in the Application of Machine Learning in Development of Toxicological Quantitative Structure-Activity Relationships (QSARs). PLOS ONE, 2023, 18, e0282924. http://hdl.handle.net/10967/264 Recent years have seen a substantial growth in the adoption of machine learning approaches for the purposes of quantitative structure-activity relationship (QSAR) development. Such a trend has coincided with desire to see a shifting in the focus of methodology employed within chemical safety assessment: away from traditional reliance upon animalintensive in vivo protocols, and towards increased application of in silico (or computational) predictive toxicology. With QSAR central amongst techniques applied in this area, the emergence of algorithms trained through machine learning with the objective of toxicity estimation has, quite naturally, arisen. On account of the pattern-recognition capabilities of the underlying methods, the statistical power of the ensuing models is potentially considerable– appropriate for the handling even of vast, heterogeneous datasets. However, such potency comes at a price: this manifesting as the general practical deficits observed with respect to the reproducibility, interpretability and generalisability of the resulting tools. Unsurprisingly, these elements have served to hinder broader uptake (most notably within a regulatory setting). Areas of uncertainty liable to accompany (and hence detract from applicability of) toxicological QSAR have previously been highlighted, accompanied by the forwarding of suggestions for “best practice” aimed at mitigation of their influence. However, the scope of such exercises has remained limited to “classical” QSAR–that conducted through use of linear regression and related techniques, with the adoption of comparatively few features or descriptors. Accordingly, the intention of this study has been to extend the remit of best practice guidance, so as to address concerns specific to employment of machine learning within the field. In doing so, the impact of strategies aimed at enhancing the transparency (feature importance, feature reduction), generalisability (cross-validation) and predictive power (hyperparameter optimisation) of algorithms, trained upon real toxicity data through six common learning approaches, is evaluated. Mon, 21 Oct 2024 11:48:57 GMT http://hdl.handle.net/10967/264 2024-10-21T11:48:57Z Kotli, M.; Piir, G.; Maran, U. Predictive Modeling of Pesticides Reproductive Toxicity in Earthworms Using Interpretable Machine-Learning Techniques on Imbalanced Data. ACS Omega 2025, 10, 4732–4744. http://hdl.handle.net/10967/263 The earthworm is a key indicator species in soil ecosystems. This makes the reproductive toxicity of chemical compounds to earthworms a desired property of determination and makes computational models necessary for descriptive and predictive purposes. Thus, the aim was to develop an advanced Quantitative Structure–Activity Relationship modeling approach for this complex property with imbalanced data. The approach integrated gradient-boosted decision trees as classifiers with a genetic algorithm for feature selection and Bayesian optimization for hyperparameter tuning. An additional goal was to analyze and interpret, using SHAP values, the structural features encoded by the molecular descriptors that contribute to pesticide toxicity and nontoxicity, the most notable of which are solvation entropy and a number of hydrolyzable bonds. The final model was constructed as a stacked ensemble of models and combined the strengths of the individual models. Evaluation of this model with an external test set of 147 compounds demonstrated a well-defined applicability domain and sufficient predictive capabilities with a Balanced Accuracy of 77%. The model representation follows FAIR principles and is available on QsarDB.org. Wed, 09 Oct 2024 12:03:43 GMT http://hdl.handle.net/10967/263 2024-10-09T12:03:43Z Piir, G.; Sild, S.; Maran, U. Interpretable machine learning for the identification of estrogen receptor agonists, antagonists, and binders. Chemosphere 2024, 347, 140671. http://hdl.handle.net/10967/259 An abnormal hormonal activity or exposure to endocrine-disrupting chemicals (EDCs) can cause endocrine system malfunction. Among the many interactions EDCs can affect is the disruption of estrogen signalling, which can lead to adverse health effects such as cancer, osteoporosis, neurodegenerative diseases, cardiovascular disease, insulin resistance, and obesity. Knowing which chemical can act as an EDC is a significant advantage and a practical necessity. New Approach Methodologies (NAM) computational models offer a quick and cost-effective solution for preliminary hazard assessment of chemicals without animal testing. Therefore, a machine learning approach was used to investigate the relationships between estrogen receptor (ER) activity and chemical structure to identify chemicals that can interact with ER. For this purpose, the consolidated in vitro assay data from ToxCast/Tox21 projects was used for developing Random Forest classification models for ER binding, agonists, and antagonists. The overall classification prediction accuracy reaches up to 82%, depending on whether the model predicted agonists, antagonists, or compounds that bind to the active site. Given the imbalance in endocrine disruption data, the derived models are good candidates for deprioritising chemicals and reducing animal testing. The interpretation of theoretical molecular descriptors of the models was consistent with the molecular interactions known in the ligand binding pocket. The estimated class probabilities enabled the analysis of the applicability domain of the developed models and the assessment of the predictions’ reliability, followed by the guidelines for interpreting prediction results. The models are openly accessible and usable at QsarDB.org according to the FAIR (Findable, Accessible, Interoperable, Reusable) principles. Thu, 14 Sep 2023 13:44:47 GMT http://hdl.handle.net/10967/259 2023-09-14T13:44:47Z