FAIR principles and QSAR models

The recent article "Moving Towards Making QSARs for Toxicity-Related Endpoints Findable, Accessible, Interoperable and Reusable (FAIR)" addresses a significant challenge in computational toxicology: how to ensure that QSAR models based on machine learning (ML) and artificial intelligence (AI) are not only developed but also made accessible and usable.

FAIRarticle

Why FAIR principles for QSARs?

QSAR models are computational tools for predicting chemical properties and toxicological endpoints without the need for animal testing. However, the majority of published QSARs are practically unusable because data is not available in machine-readable formats. Moreover, they lack unique identifiers, standardised metadata, and clear licensing for reuse.

In order to overcome these limitations, FAIR principles provide a set of guidelines to improve the Findability, Accessibility, Interoperability, and Reusability of digital assets, including research data. The study evaluated six machine learning QSAR models against the FAIR criteria and found significant gaps, particularly in findability and interoperability. These caps were addressed by converting the data to QsarDB data format and making it available via the QsarDB repository.

Case study: FAIRifying ML/AI models

In this study, six QSAR models for predicting Tetrahymena pyriformis growth inhibition were uploaded to the QsarDB repository. It was a good test case because the models covered six different types of ML and AI methods, including k-NN, RF, SVM, XGB, ANN, and deep-ANN.

This process involved:

  • Reproduced the original models and converted them to ONNX format.
  • Stored models and related data in the QsarDB data representation.
  • Deposited to the archive and its metadata to the QsarDB repository.
  • QsarDB repository assigned persistent identifiers (DOI, handle.net) and made the models findable and reusable.

As a result, you can now explore these models in the repository.

Lessons learned

This use-case shows how QsarDB can transform existing non-FAIR models to FAIR resources, thereby bridging the gap between academic research and practical application. Although we recommend the PMML format as the default model representation, the ONNX format can be a good alternative for deep learning and very complex ML models.

Previous Post