Get started


Uploading QDB archives

QsarDB repository welcomes the contribution of (Q)SAR/QSPR models that are related with a scientific paper that is either published or accepted for the publication.

  • The uploaded models must be represented in QsarDB archive format (please make yourself familiar with QsarDB archive format and with tools that allow you prepare QDB archives)
  • Research groups will receive their own community and manage their collections of models. Each community and collection will be referenced using unique and persistent HDL identifiers (Handle System) for unique citing.

Please contact us via e-mail qsardb@chem.ut.ee and we'll help you to get started.

Examples for realized PMML-s

QsarDB file format is using Predictive Model Mark-up Language (PMML) for the mathematical definition of in silico predictive model. Literally every model type that has PMML support is also supported by the QsarDB repository. Good example models are hard to find. Following list of examples are currently realized PMMLs, their corresponding archives and information required for PMML to be usable within QsarDB.

Regression model (regression)

Example: Moosus, M.; Maran, U. Quantitative structure-activity relationship analysis of acute toxicity of diverse chemicals to Daphnia magna with whole molecule descriptors. SAR and QSAR in Environmental Research 2011, 22, 7-8, 757–774.

Archive: http://dx.doi.org/10.15152/QDB.111

Regression model (classification)

Example: Benigni, R.; Bossa, C.; Netzeva, T.; Rodomonte, A.; Tsakovska, I. Mechanistic QSAR of aromatic amines: New models for discriminating between homocyclic mutagens and nonmutagens, and validation of models for carcinogens. Environmental and Molecular Mutagenesis 2007, 48, 754–771

Archive: http://dx.doi.org/10.15152/QDB.141

Neural network (regression)

Example: Modarresi, H.; Modarress, H.; Dearden, J. C. Henry’s law constant of hydrocarbons in air–water system: The cavity ovality effect on the non-electrostatic contribution term of solvation free energy. SAR and QSAR in Environmental Research 2005, 16, 461–482.

Archive: http://dx.doi.org/10.15152/QDB.150

Random forest (classification)

Example: Piir, G.; Sild, S.; Maran, U. Classifying bio-concentration factor with random forest algorithm, influence of the bio-accumulative vs. non-bio-accumulative compound ratio to modelling result, and applicability domain for random forest model. 2014, 25, 967-981.

Archive: http://dx.doi.org/10.15152/QDB.116

Support vector machine (regression)

Example: Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Pidol, L.; Jeuland, N.; Creton, B. Flash Point and Cetane Number Predictions for Fuel Compounds Using Quantitative Structure Property Relationship (QSPR) Methods. Energy & Fuels 2011, 25, 9, 3900–3908.

Archive: http://dx.doi.org/10.15152/QDB.123

Ensemble model (regression)

Example: Saldana, D. A.; Starck, L.; Mougin, P.; Rousseau, B.; Pidol, L.; Jeuland, N.; Creton, B. Flash Point and Cetane Number Predictions for Fuel Compounds Using Quantitative Structure Property Relationship (QSPR) Methods. Energy & Fuels 2011, 25, 9, 3900–3908.

Archive: http://dx.doi.org/10.15152/QDB.123

Decision tree (classification)

Example: Ringeissen, S.; Marrot, L.; Note, R.; Labarussiat, A.; Imbert, S.; Todorov, M.; Mekenyan, O.; Meunier, J.-R. Development of a mechanistic SAR model for the detection of phototoxic chemicals and use in an integrated testing strategy. Toxicology in Vitro 2011, 25, 324–334.

Archive: http://dx.doi.org/10.15152/QDB.139

Create QDB archive

One can create QDB archive by following standard as published in J. Cheminf. 2014, 6:25 (DOI: 10.1186/1758-2946-6-25). To avoid programming command line tools and/or QsarDB editor have been created that convert tabular and structure files into QDB archive. Graphical guide below describes how to make QDB archive using QsarDB editor (also available as a PDF-file).

Best practices

Following are some guidelines how to fill the fields in QDB archive.

How to name QDB archive file?

Archive naming uses following convention: Year, Journal’s abbreviation (first letters), First page, .qdb.zip

Example:

  • Year: 2014
  • Journal: SAR and QSAR in Environmental Research
  • Pages: 967-981
  • Name of the QDB archive file: 2014SQER967.qdb.zip

Archive’s information

Archive’s name should match the article’s name

Example: Classifying bio-concentration factor with random forest algorithm, influence of the bio-accumulative vs. non-bio-accumulative compound ratio to modelling result, and applicability domain for random forest model

Archive’s description should match the article’s abstract

Example: In environmental risk assessment, the bio-concentration factor (BCF) is a widely used parameter in the estimation of the bio-accumulation potential of chemicals. BCF data often have an uneven distribution of classes (bio-accumulative vs. non-bio-accumulative), which could severely bias the classification results towards the prevailing class. The present study focuses on the influence of uneven distribution of the classes in training phase of Random Forest (RF) classification models. Three different training set designs were used and descriptors selected to the models based on the occurrence frequency in RF trees and considering the mechanistic aspects they reflect. Models were compared and their classification performance was analysed, indicating good predictive characteristics (sensitivity = 0.90 and specificity = 0.83) for the balanced set; also imbalanced sets have their strengths in certain application scenarios. The confidence of classifications was assessed with a new schema for the applicability domain that makes use of the RF proximity matrix by analysing the similarity between the predicted compound and the training set of the model. All developed models were made available in the transparent, accessible and reproducible way in QsarDB repository http://dx.doi.org/10.15152/QDB.116.

Compounds information

  • ID Compounds ID must be the same as in the corresponding article
  • Name Recommended compounds Name is its preferred IUPAC name
  • InChI The InChI value must be a standard InChI (starts with prefix “InChI = 1S”), non-standard InChI must be as structure Cargo
  • CAS CAS registry number (CAS RN) is the identifier for chemical substances that includes all categories of chemical compounds in the CAS registry database
  • Label
  • Description
  • MDL molfile MOL file as structure Cargo (3D structure information)
  • SMILES Simplified Molecular-Input Line-Entry System (SMILES) string string as structure Cargo (2D structure information)

Properties information

  • ID Property ID must be as close as possible to the ID used in article (e.g. logBCF, BCF_class)
  • Name Property Name should gives short description about property [e.g. Experimental logarithmic BCF, Experimental BCF class (nB - non-bio-accumulative, B - bioaccumulative)]
  • Endpoint Endpoint is the experimental test classification, i.e. physico-chemical, biological, or environmental effect that has been measured. (e.g. 2. Environmental fate parameters 2.4. Bioconcentration)
  • Species Species is the name of the species according to the binomial nomenclature. This attribute is only applicable to Properties that represent biological activities. (e.g. Cyprinos carpio (common carp))
  • UCUM The unified code for units of measure (UCUM) contains unit of the experimental property. If property is dimensionless UCUM can be omitted. Documentation for UCUM can be found in http://unitsofmeasure.org/ucum.html
  • BibTeX BibTeX contains information from where experimental measurements originate.

Descriptors information

  • ID Preferred descriptor’s ID must ideally match descriptor calculation software’s internal descriptor ID
  • Name Name should gives short description about descriptor. Preferably must be same as in descriptor calculation software documentation
  • Application Application should refer to used software and its version (e.g. PaDEL-Descriptor 2.18)
  • UCUM If descriptor has an unit it must be represented as UCUM http://unitsofmeasure.org/ucum.html

Models information

  • ID Model’s ID must give enough information to precisely locate model in the article. (e.g. Tab1.Model1, Eq8)
  • Name Model’s Name gives short description about model. It is recommend not including property’s name into the model’s name to avoid repetition (e.g. Melting point model). If certain group of chemicals were modelled, the name of the chemical class will add extra information for the model user (e.g. Model for hydrocarbons). Another example of model’s name is ‘Imbalanced model towards B-compounds’
  • Property Property links model and modelled endpoint together.
  • PMML Predictive Model Markup Language (PMML) is an XML based data format for the representation of statistical and data mining models. Documentation and examples of use can be found at http://www.dmg.org/v4-1/GeneralStructure.html
  • BibTeX BibTeX contains information from where used model originate.

Predictions information

  • ID Preferred prediction’s ID gives information about used model and data set (e.g. M1.train, M2.valid)
  • Name Name should gives short description about data used for predictions (e.g. Training set, Validation set)
  • Model Model links predictions to the model
  • Type Type can be “training”, “validation” or “testing”.
    • training - Predictions for a data set used for the model development
    • validation – Model benchmarking and making predictions on known chemical systems
    • testing - Making predictions on unknown chemical systems
  • Application Application refers to software and its version which was used for modelling (e.g. Random Forest 4.6-7)
  • UCUM If property has an unit it must be represented as UCUM http://unitsofmeasure.org/ucum.html
  • BibTeX BibTeX contains information from where predictions originate.

API for QsarDB


Predictor service

Predictor service performs a single prediction with a (Q)SAR model deposited to QsarDB repository.

Resource URL:

http://qsardb.org/repository/service/predictor/{handle}/models/{model}?{structure}

Parameters

Name Parameter description
handle QsarDB repository handle for a deposited QDB archive
model model identifier in the QDB archive
structure chemical structure in SMILES or InChI representation

Example request

http://qsardb.org/repository/service/predictor/10967/104/models/rf?CC(=O)O

Notes

Currently this functionality is supported only for a limited set of models that use Chemistry Development Kit descriptors:

Handle Model ID QDB archive
10967/104 rf Lang, Andrew SID; Bradley, Jean-Claude ONS Melting Point Model 010.
10967/103 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham descriptor L.
10967/102 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham descriptor V.
10967/101 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham descriptor B.
10967/100 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham descriptor A.
10967/99 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham descriptor S.
10967/98 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham descriptor E.
10967/97 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham model solvent coefficient v.
10967/96 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham model solvent coefficient b.
10967/95 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham model solvent coefficient a.
10967/94 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham model solvent coefficient s.
10967/93 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham model solvent coefficient e.
10967/6 rf Lang, Andrew SID; Bradley, Jean-Claude Abraham model solvent coefficient c.