Skip to main content

Creating a README File for Your Shared Data

The purpose of the README file is to provide a clear and concise description of the content and structure of the data included in your dataset, so that the data can easily be understood and used by others. The kinds of information that should be in a README file are described in the following sections.

Data provenance

Data provenance describes information about where the data originated, how it was produced, and what transformations have occurred since it was created:

  • Who created the data? For example, who collected or created the data, who has analysed it, who has transformed the data, who created the data package, and who deposited the data.
  • ORCID IDs for contributors to the research and data package
  • Information about the research project: Institutions, funding, research project title, and why was the data collected?
  • How was the data produced? For example, how it was captured, instruments, methodologies, and techniques used, and experimental conditions.
  • How has the data has been transformed? What processed has the data undergone, has it been converted to a different file type, for analysis or to bring it into a standard format for deposition in a repository?

Data structure

Information should be provided in the README to describe how the files have been organised and named:

  • How are the files organised in the data package or data set, including folder or directory structures?
  • What is the naming scheme for files and directories?
  • What are the file formats included?
  • Provide a list of files (or directories / groups of files) with a brief description of what the files contain.

Version

The README should contain information about the version of research data included:

  • What is the version of the dataset?
  • When was the data last updated?

Code and parameters

Information about the software or code that was used to produce, analyse, or transform the data should be included in the README:

  • What scripts, software, or other code was used to produce or manipulate the data?
  • What settings or parameters were used that would be needed to replicate or reuse the data?
  • Instructions on how to install and run any code or software resources provided.
  • Any known problems with the code or the data?

License conditions

Any usage and attribution conditions or restrictions should be included in the README, or specified in a separate license file:

  • If there is a separate license file, what is the name of the license file and where can it be found in the data package? If not:
  • What are the usage conditions of the data and how may they be reused?
  • How can the data be cited and what attributions are required?
  • If the data was based on another primary dataset, how can that data be cited?

Additional metadata

The following are additional important items of metadata to include in the README file:

  • Title of the dataset
  • Dates that the data was collected
  • Keywords
  • Attributes of the dataset, for example column and row labels

Making the README file machine readable

To make the README machine-readable, avoid using proprietary formats such as PDF, Word documents and Rich Text Format, and instead use plain Text (‘.TXT’) or a machine-readable format such as Markdown(‘.MD’). Incorporate metadata in a structured format. For example, you can use JSON or YAML to include essential information like title, description, author, and license.

What to do next

Related links: Cornell Data Services provide useful resources to help you write a README file for your research data:


About this page

If you would like to contribute content to the PSDI Knowledge Base or have feedback you would like to give on this guidance, please contact us.