Skip to main content

Choosing a Repository for Your Research Data

During your research you are likely to need to deposit your research data in a repository or database of some kind. This may be because you need a place to store or archive the data, or because you want to share it with collaborators or the wider scientific community when you publish the research. Depositing data in repositories is important for adherence to the FAIR Data Principles because:

  • Findable: Repositories ensure data is discoverable by assigning unique identifiers (e.g., DOIs) and providing rich metadata that is indexed in searchable resources.
  • Accessible: They make data and metadata available through standardized protocols, often with clear access conditions, ensuring both humans and machines can retrieve the data.
  • Interoperable: Repositories use standardized formats, vocabularies, and metadata to enable data integration and compatibility with other datasets and tools.
  • Reusable: They provide detailed documentation, provenance information, and clear licensing to facilitate data reuse and replication in future research.

Repositories often provide ways to restrict access on the research data that has been deposited and many also provide capabilities to embargo data, so using a repository does not automatically mean that the data is made open.

Types of repositories

Repositories for scientific research data can be categorized into several types based on their focus and scope:

  • Discipline-specific repositories
  • Generalist or generic repositories that accept data from any discipline (e.g., Zenodo, Figshare)
  • Institutional repositories, managed by universities or research institutions
  • Government repositories, usually for publicly funded research data (e.g., UK Data Service).
  • Publisher-linked repositories for academic journals to store data linked to published articles

Considerations for choosing a repository

In certain situations, such as those dictated by your project, funder, institution, or publication, the choice of repository may be predetermined. However, when you do have the opportunity to choose, it is important to select the repository that best suits your data for several important reasons, including:

  • Ensuring the long-term access and preservation of your data
  • Maximising discoverability enabling others to find, reuse, and cite your data
  • Complying with funder and publisher requirements and standards
  • Credibility and transparency for your research
  • Improving reusability by supporting the FAIR principles
  • Appropriate accessibility through access control for data that cannot be publicly shared
  • Discipline specific for domain specific standards or aggregation with related data
  • Cost and sustainability considerations
  • Technical functionality, for example the support of large datasets, version control, and tools for working with the data, back-ups etc.
  • Licensing and legal compliance considerations

TRUST Principles

The TRUST principles for digital repositories are a framework designed to guide the selection of trustworthy digital repositories for research data:

  • Transparency: Repositories should clearly communicate their policies, procedures, and governance to build trust with users. These should include information about data deposition, data preservation, discovery, terms of use and whether additional functions are provided such as capabilities for managing sensitive data.
  • Responsibility: They must demonstrate accountability in managing and preserving data including adherence to appropriate standards, provision of data services, and managing and protecting the data.
  • User Focus: Repositories should prioritize the needs of their user communities, ensuring accessibility and usability. Providing discoverability for others requires that repositories encourage users to fully describe their data at the time of deposition and enforcing community standards.
  • Sustainability: Long-term preservation and uninterrupted access to data should be supported through reliable funding, governance, and infrastructure.
  • Technology: Robust and secure technological systems should be in place to maintain data integrity and accessibility, and prevent potential threats.

Some certification programs exist for repositories to demonstrate that they meet standards of trustworthiness and reliability. Some widely recognized certifications include:

  • CoreTrustSeal (CTS): An international, community-based certification that evaluates repositories based on criteria like data integrity, accessibility, and long-term preservation.
  • ISO 16363: An international standard for auditing and certifying trustworthy digital repositories.
  • Nestor Seal for Trustworthy Digital Archives: based on a German standard (DIN 31644).

Choosing a discipline specific repository

Where possible, the majority of funders and publishers will recommend that you deposit your data in a discipline specific, community recognised repository. You should check the advice given by your funder, publisher, or institutional librarian if there is specific guidance for your discipline or data type. Ensure that the selected repository meets your requirements for the FAIR principles, for example by providing a DOI for your data and provides access to the data in a standard format.

Search by subject using the list below to view a list of available repositories for that subject or data type:

PSDI-affiliated

RepositoryData TypeDescription
Biomolecular Simulations Database (BioSimDB)Datasets in any file format up to 100GBThe biomolecular simulation database (BioSimDB) is a free repository of trajectory files produced from molecular dynamics (MD) simulations of biomolecules.
Collaborative Computational Project for NMR Crystallography (CCP-NC) Magres DatabaseMagres formatThe Magres database is a repository of first-principle computational results for solid-state NMR Crystallography, stored in the Magres (.magres) format. This resource provides a central platform for researchers to share, explore, and utilise data associated with calculations of NMR parameters in solid-state structures.
Data to KnowledgeData collection in any file format up to 100GBThe Data to Knowledge Community Collection is a repository to store data to be used or generate by machine learning models in modelling materials and molecular systems. This currently includes simulations data, training data and models themselves.

Computational simulations

RepositoryData TypeDescription
ioChem-BDComputational chemistry filesData stored as Chemical Markup Language (CML) (XML-CML). Data can be worked on in a private area before publication.
Materials CloudComputational materials scienceMaterials Cloud is built to enable the seamless sharing and dissemination of resources in computational materials science, offering educational, research, and archiving tools; simulation software and services; and curated and raw data. You can browse, explore, download, or deposit raw and curated data.
NOMADMaterials simulation data including electronic structure and molecular dynamics.Upload and manage raw materials science data and search; supports most community codes and file formats. Enables access to search and download materials data in raw and processed forms.
Protein Data Bank (PDB)Coordinate files (PDBx/mmCIF, PDB, XML)Repository for experimentally-determined 3D structures for large biological molecules.

Environmental

RepositoryData TypeDescription
Centre for Environmental Data Analysis (CEDA) ArchiveAtmospheric and earth observation research and environmental data. Any format but must be well formatted and be accompanied by appropriate documentationA Core Trust Seal approved repository of atmospheric and earth observation data from climate models, satellites, aircraft, met observations, and other sources.
Environmental Data Initiative (EDI)EDI publishes data from the ecological and environmental sciences including very large datasetsA Core Trust Seal approved repository helping the scientific community curate and preserve all scales of environmental and ecological data.
EarthChemGeochemical, geochronological, and petrological data in any format but must be adequately documentedEarthChem provides open data services to the geochemical, petrological, mineralogical, and related communities. Services include data preservation, discovery, access, and visualization. EarthChem adheres to the FAIR, TRUST and [CARE] principles.
World Data Center for Climate (WDCC)Earth System Model data, including climate data and models. Only open source data formats are accepted. Network Common Data Format(NetCDF) is preferred, but also accepts GRIdded Binary(GRIB), CSV, ASCII, and Zarr).WDCC is a Core Trust Seal approved Repository. WDCC is the long-term archiving service in the WDCC primarily for DKRZ HPC (high performance computing) project data but also accepting data from external sources.

Images

RepositoryData TypeDescription
Coherent X-ray Imaging Data BankData from Coherent X-ray Imaging (CXI) experiments in CXI formatCXIDB is dedicated to further the goal of making data from Coherent X-ray Imaging (CXI) experiments available to all, as well as archiving it. The website also serves as the reference for the CXI file format, in which most of the experimental data on the database is stored in.
Electron Microscopy Public Image Archive (EMBIAR)MRC, MRCS, TIFF, DM4, IMAGIC, SPIDER, MRC FEI, RAW FEI and BIG DATA VIEWER HDF5EMBIAR stores raw images from cryo-electron microscopy (cryo-EM) and 3D datasets from volume EM techniques and soft and hard X-ray tomography. Individual EMPIAR entries are assigned a DOI, and the materials can be reused without any conditions or restrictions.
Electron Microscopy Data Bank (EMDB)3D map volumes in CCP4 map formata public repository for cryogenic-sample Electron Microscopy (cryoEM) volumes and representative tomograms of macromolecular complexes and subcellular structures.
Image Data ResourceSupports all formats available in the Bio-Formats library](https://bio-formats.readthedocs.io/en/stable/supported-formats.html), but open formats such as OME-TIFF are preferred. Very large datasets are accepted.The Image Data Resource (IDR) is a public repository of reference image datasets from published scientific studies including bioimages, multidimensional life sciences image data (cell and tissue).

Life sciences

RepositoryData TypeDescription
FlowRepositoryflow cytometry data in MIFlowCyt standard formatFlowRepository is a database of flow cytometry experiments
Protein Circular Dichroism Data Bank (PCDDB)circular dichroism (CD) and synchrotron radiation CD (SRCD) spectral dataThe Protein Circular Dichroism Data Bank (PCDDB) is a public repository that archives and freely distributes circular dichroism and synchrotron radiation spectral data and their associated experimental metadata. The entries are linked, when appropriate, to primary sequence (UniProt) and structural (PDB) databases, as well as to secondary databases such as the Enzyme Commission functional classification database and the CATH fold classification database
Standards for Reporting Enzymology Data (STRENDA DB)Functional enzymology data (kinetic and experimental data)Strenda provides extensive guidelines on how to prepare functional enzymology data and includes tools to enable researchers to check that datasets and documentation are complete, valid, and meet compliance with the STRENDA Guidelines.

A list of additional Life Science related data repositories and databases that may also be relevant to researchers in the Physical Sciences can be accessed from the Services: Data Resources on the Elixir website.

Materials

RepositoryData TypeDescription
Materials CloudAll formats are supported, but data is expected to be of value and able to be reused by others in the field. Large datasets are supported.Materials Cloud enables sharing and dissemination of resources in computational materials science, offers educational, research, and archiving tools; simulation software and services; and curated and raw data. The Materials Cloud Archive is an open-access, moderated repository for research data in computational materials science, with particular focus on sharing the full provenance of calculations.
Materials Data FacilityVarious materials data types including very large datasetsMaterials Data Facility is a platform designed to help researchers publish, discover, and access materials science datasets. Deposited datasets are provided with a DOI and tools are provided for data aggregation and machine learning.
Novel Materials Discovery (NOMAD)Materials simulation data including electronic structure and molecular dynamics data.NOMAD is a web-based platform to organize, analyze, share, and publish materials science data. NOMAD supports most community codes and formats, including atomistic codes, workflow managers, and database managers. Data is stored using NOMAD's standard common data format.
MPContribsComputational and experimental materials dataMPContribs provides a platform and advanced programming interface (API) to contribute computational as well as experimental data to Materials Project.

Omics and sequence data

RepositoryData TypeDescription
ArrayExpressHigh-throughput functional genomics data in a large number of microarray and HTS formatsThe functional genomics data collection (ArrayExpress), stores data from high-throughput functional genomics experiments, including metadata such as detailed sample annotations, protocols, processed data and raw data. Raw sequence reads from high-throughput sequencing studies are brokered to the European Nucleotide Archive (ENA).
Database of Genotypes and Phenotypes (dbGaP)Multiple, including: Raw (TIFF), Initial sequence reads (TXT, CEL, FASTQ), and analysed data (TXT, Sequence Alignment Map (SAM), Binary Alignment Map (BAM), BED, WIG, VCF, MAF, PED).The database of Genotypes and Phenotypes (dbGaP) was developed to archive and distribute the data and results from studies that have investigated the interaction of genotype and phenotype in Humans.
dbVarStructural variants (>50bp) that are do not contain personal or sensitive clinical information.dbVar is NCBI's database of human genomic Structural Variation — large variants >50 bp including insertions, deletions, duplications, inversions, mobile elements, translocations, and complex variants.
European Variation ArchiveEVA accepts all types of precise genetic variants, in any species in valid VCF formatThe European Variation Archive is an open-access database of all types of genetic variation data from all species.
Gene Expression OmnibusArray and sequence-based data are accepted.GEO is a public functional genomics data repository supporting MIAME-compliant data submissions.
Genome Sequence ArchiveFastq, BAM, HDF5, Reference FASTA, SFF, and SRF recommendedThe Genome Sequence Archive (GSA) is a data repository for for genome, transcriptome and other omics primitive sequencing data.
International Nucleotide Sequence Database Collaboration (INSDC)VariousThe International Nucleotide Sequence Database Collaboration (INSDC) archives nucleotide sequence data, from raw to assembled and annotated sequences, from around the world. Partners include DNA Data Bank of Japan (DDBJ), European Nucleotide Archive(ENA), and GenBank
IntAct molecular interaction database (IntAct)Molecular interaction data, XML files following the PSI-MI standard is recommended.IntAct provides an open source database and analysis tools for molecular interaction data. Unique accession numbers are supplied for submissions and datasets of any size are supported.
MetaboLightsVarious Raw Spectral files accompanied by open source derived files such as mzML and nmrMLMetaboLights is a database for Metabolomics experiments and derived information. The database is cross-species, cross-technique and covers metabolite structures and their reference spectra as well as their biological roles, locations and concentrations, and experimental data from metabolic experiments.
Metabolomics WorkbenchMetabolomics data for small and large studies on cells, tissues and organisms. MS, NMR and other analyses are accepted.The Metabolomics Workbench serves as an international repository for metabolomics data and metadata and provides analysis tools, metabolite standards, and protocols.
MGnifyVariousMGnify is a free to use resource for analysis, visualisation and discovery of metagenomic, metatranscriptomic, amplicon and assembly datasets.
miRBase: the microRNA databaseVariousmiRBase is a searchable database of published miRNA sequences and annotations and naming service.
ProteomeXchangeMS/MS proteomics and SRM data. Other proteomics data may be possible.The ProteomeXchange Consortium provides globally coordinated standard data submission and dissemination pipelines involving the main proteomics repositories. ProteomeXchange is a Global Core Biodata Resource. Member repositories include the PRIDE repository for proteomics mass spectroscopy data
Sequence Read Archive (SRA)Genetic data and the associated quality scores in BAM, CRAM, SFF, HDF5, FASTQ, FASTA, and CSFASTAis the largest publicly available repository of high throughput sequencing data. The archive accepts data from all branches of life as well as metagenomic and environmental surveys. SRA stores raw sequencing data and alignment information to enhance reproducibility and facilitate new discoveries through data analysis.
Universal Protein Resource (Uniprot)UniProtKB/Swiss-ProtUniProt is a comprehensive resource for protein sequence and annotation data.

A list of additional Life Science related data repositories and databases that may also be relevant to researchers in the Physical Sciences can be accessed from the Services: Data Resources on the Elixir website.

Spectroscopy

RepositoryData TypeDescription
Biological Magnetic Resonance Data Bank (BMRB)BMRB accepts NMR spectral parameters, relaxation data, other kinetic data, and thermodynamicBMRB collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.
ChemotionMass spectrometry, NMR, IR and Raman, XRD, UV-VIS, and Cyclic voltammetry. mzML, mzXML, JCAMP-DX and various vendor formatsChemotion Repository is a field-specific repository for molecular and synthetic chemistry and covers research data that is assigned to molecules, their properties and characterisation as well as reactions and experimental investigations.
MassBankAll kinds of mass spectral data are accepted in MassBank Record FormatMassBank is an open source mass spectral library for the identification of small chemical molecules of metabolomics, exposomics and environmental relevance.
nmrshiftdb2JCAMP-DX, PDF, raw data or imagenmrshiftdb2 is a NMR database (web database) for organic structures and their nuclear magnetic resonance (nmr) spectra. It allows for spectrum prediction (13C, 1H and other nuclei) as well as for searching spectra, structures and other properties. The nmrshiftdb2 software is open source, the data is published under an open content license. Submissions can be kept private and published later.
NMR Online Managment And Datastore (NOMAD-NMR)VariousNOMAD-NMR is an open source tools that can be to manage automatic upload and storage of NMR datasets. NOMAD-NMR aims to develop a decentralized peer-to-peer repository for NMR data
nmrXivAll major NMR formats are supportednmrXiv is a FAIR and Open, Consensus-Driven Nuclear Magnetic Resonance (NMR) Data Repository and Computational platform.
Open Spectral Database (OSDB)VariousOSDB is an open source database for scientists to share spectral data and export it in various formats to support open science. Each spectrum is assigned a persistent identifier.

Software, code and models

RepositoryData TypeDescription
BioModelsSBML and CellML recommendedBioModels is a repository of mathematical models of biological and biomedical systems aimed at systems biology and pharmacology researchers. An identifier is provided for submissions.
code oceanVariousCode Ocean is a cloud-based computational reproducibility platform that allows researchers to easily share, discover, and run code, data, and environments associated with their research, ensuring reproducibility and traceability.
GitHubGithub accepts files and also large attachments in a variety of formatsGitHub is a web-based platform that uses Git, a version control system, to allow developers to store, manage, and share code, facilitating collaboration and open-source software development. GitHub can be used in conjunction with Zenodo to create a DOI for your repository.
GitlabVariousGitLab is a web-based platform for version control, with collaboration tools and workflow automation.

Structural

RepositoryData TypeDescription
Biological Magnetic Resonance Data Bank (BMRB)BMRB accepts NMR spectral parameters, relaxation data, other kinetic data, and thermodynamicBMRB collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.
Cambridge Structural Database (CSD)Crystallographic information file (CIF)The CSD is the world’s largest database of small-molecule organic and metal-organic crystal structure data. The database is CoreTrustSeal certified, and recognized as a trusted data repository.
Crystallography Open Database (COD)](https://www.crystallography.net/cod/)CIF filesCOD is an open-access collection of crystal structures of organic, inorganic, metal-organic compounds and minerals, excluding biopolymers.
Electron Microscopy Data BankMRC fileThe Electron Microscopy Data Bank (EMDB) is a public repository for cryogenic-sample Electron Microscopy (cryoEM) volumes and representative tomograms of macromolecular complexes and subcellular structures.
Inorganic Crystal Structure Database(ICSD)Organic, metal-organic and inorganic experimental crystal structures as well as theoretical structuresThe Inorganic Crystal Structure Database (ICSD) is the world's largest database for completely identified inorganic crystal structures.
Integrated Resource for Reproducibility in Macromolecular Crystallography (IRRMC)Data from diffraction experiments used to determine protein structuresIRRMC is an open comprehensive repository to archive raw data, including metadata from macromolecular diffraction experiments.
International Centre for Diffraction Data (ICDD)Powder Diffraction File (PDF)ICDD is a non-profit scientific organization dedicated to collecting, editing, publishing, and distributing powder diffraction data for the identification of materials.
Protein Data Bank (PDB)Bioloigical crystal structures in PDBx/mmCIF formatThe PDB provides access and tools for exploration, analysis, and visualisation of experimentally-determined 3D structures and computed structure models.

Supramolecular

RepositoryData TypeDescription
Biological Magnetic Resonance Data Bank (BMRB)BMRB accepts NMR spectral parameters, relaxation data, other kinetic data, and thermodynamicBMRB collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.
SUPRABANKJSON (DataCite), CDX (for 2D/3D molecule structure), PNG, and various proprietary formatsSUPRABANK is a curated database of intermolecular interactions of molecular systems and supramolecular interactions for researchers in organic chemistry dealing with binding, assembly, and interaction phenomena.

Repository registries

If your research data does not fit into any of the above disciplinary repository options you can try to find an alternative by searching a registry of data repositories:

If you are still unable to find a suitable repository for your data, consider choosing a general chemistry repository.

RepositoryData TypeDescription
Research Data Repository for Chemistry (RADAR4Chem)VariousRADAR4Chem is a multidisciplinary repository for the publication of research data from all disciplines of chemistry.

Generalist repositories

It is not always possible to find an appropriate discipline-specific repository for your data, and therefore it may be necessary to use a generalist repository.

4TU.ResearchDataResearch Data Australia
DryadFigshare
Harvard DataverseMendeley Data
Open Science Framework (OSF)Science Data Bank

What to do next

Related links:


About this page

If you would like to contribute content to the PSDI Knowledge Base or have feedback you would like to give on this guidance, please contact us.