Skip to main content

Recommended Fields for PSDI Resources

PSDI partners are free to define fields (which are sometimes also referred to as "variables" or "properties") in their datasets as they choose, according to their domain and the community best-practices for their technique. When defining these fields, the recommendations below should also be considered:

  1. General guidelines for maximising compatibility with PSDI Cross Data Search
  2. Best-practices for capturing key data types

PSDI Cross Data Search uses OPTIMADE to search across commonly defined fields in multiple resources in a consistent way. It is more straightforward to make this data set searchable using PSDI Cross Data Search if fields can be aligned with those described below.

Recommendations to maximise PSDI Cross Data Search compatibility:

Data resources can then feed into PSDI Cross Data Search either by:

  • Becoming an OPTIMADE database provider and realising an OPTIMADE API to serve data from one or more databases
  • Indexing data via backend PSDI indexing service (via an alternative API)

Best practices for describing common chemistry entities

The PSDI Data Conversion service can be used to convert between some these formats to generate more representations of the same molecule, and there are various cheminformatics toolkits which can be used to do likewise e.g. RDKit

Molecules

There are many descriptors and representations of molecules and it is best practice to store as many as possible since they all have their strengths and weaknesses:

  • InChI, InChIKey and AuxInfo - please use the standard InChI and InChIKey
  • SMILES - please ensure that the canonical smiles is used
  • Mol file - this is less easy to search on than the more compact identifiers above but can give more unambigous definition of tautomers
  • ChemDraw file (.cdx or preferably .cdxml) - if this is the original representation by a chemistry researcher it is useful to store in case there are conversion errors in extracting other molecule descriptors. If possible to save as .cdxml format this is more interpretable subsequently than the cdx binary file.
  • Chemical database identifiers - urls that link to records about the molecules databases. We recommend linking to ChEBI representations of molecules, since this is highly curated and contains ontological information.
  • IUPAC name

Descriptors which overlap with molecule properties defined in the cheminformatics namespace should follow those defintions where possible.

Reactions

In the same way, when referring to reactions, it is good practice to store metadata that describes the reactant in terms of its reactant molecules and products in a variety of formats:

  • Reaction SMILES (which is basically the SMILES representation of the reactants and products concatenated in a particular way)
  • RInChI and RInChIKey (which are basically the InChI and InChIKey representations, respectively, of the reactants and products concatenated in a particular way)
  • RXN file (which is basically the MOL representation of the reactants and products concatentated in a particular way)
  • ChemDraw file (.cdx or preferably .cdxml) - often this is the original representation by a chemistry researcher and as such it useful in case there are conversion errors in extracting other molecule descriptors. If possible to save as .cdxml format this is more interpretable subsequently than the cdx binary file.
About this page

If you would like to contribute content to the PSDI Knowledge Base or have feedback you would like to give on this guidance, please contact us.