Sharing data

Sharing data at the end of a project makes it available for reuse by others. This is increasingly being encouraged by both funders and the research community more generally: it is very rare for the full potential of a research data set to be fully mined in one project, and sharing helps maximise the value of the data.

While not all data is suitable for sharing, the general trend is towards openness as the default, with restrictions only as necessitated by specific legal, ethical, or commercial considerations.

Why share research data?

Expand All

Benefits to data creators

  • Wider dissemination of research findings: there is evidence that journal articles for which the underlying data has been made available receive a boost to their citation count.
  • Increased credit and recognition: data is a research output itself which can be cited like a published paper, crediting the researcher who produced it. This in turn can be used to help demonstrate research impact, and allows researchers to get proper acknowledgement for this portion of the work.
  • Demonstration of research integrity: sharing the data which underpins research findings helps to clarify how that research has been conducted, and how conclusions were arrived at.

Benefits to the wider academic community

  • Reduced duplication of effort: if data is made available to others, considerable time, money and resources can be saved by not re-collecting data that already exists.
  • Speeding up research: pooling shared data may help researchers answer their research questions more rapidly. If this permits a larger body of data to be analysed, conclusions may also be more robust.
  • Stimulation of additional research: shared datasets can provide inspiration for new work, or may make possible a project that would not otherwise have been achievable.
  • Aiding research reproducibility: having access to the underlying data makes it easier for other researchers to see how conclusions were arrived at.

Benefits to funding bodies

  • More efficient use of funding: if data from earlier projects is available, funders do not have to pay for additional data collection when this is unnecessary.
  • Better return on investment: it is rare for the full value of a dataset to be exploited by a single project. If the same dataset can be used in multiple endeavours, more research can be achieved for the same outlay.

These benefits may also make research projects which plan to share their data more attractive to funding bodies, resulting in another potential benefit to researchers who opt to do this.

Researchers are sometimes concerned that their data will be misused, or that it will be used only to question the original analysis. However, in practice, this is rare. Researchers can reduce the risk of misinterpretation of their work by ensuring that data is well documented, and including clear methods information helps make it straightforward for other researchers to validate their conclusions. But shared data is also frequently used in ways not envisaged by the data creator: the focus may be on variables or aspects of the dataset deemed unimportant for the original project, for example, or the data may turn out to be valuable to researchers in another discipline, or may provide inspiration in terms of content or methodology.

What is open scholarship?

Recent years have seen a move towards greater openness in academic research. This takes a number of forms, including open access publications, open research methods, and open data. There are a number of key drivers:

Widening access to the results of research

Making research outputs more openly available allows anyone with an interest to make use of them, without the need for costly subscriptions. Where research is funded with public money, it is particularly appropriate that the fruits of that research should be publicly accessible.

Enabling collaboration

Widening access makes it easier for researchers to build on each other's work. This helps to make the process more efficient, and to increase the quantity of productive work that can be done.

Greater transparency and reproducibility

Better information about how conclusions were arrived at helps researchers to retrace the steps of those who have gone before them. This helps to increase understanding, validate results, and make the research process more robust.

What about open data?

Open data is an important part of open scholarship. Data is open if it is available for anyone to access, use, modify, and share. There may be some minimal requirements (e.g. for the data creator to be credited), but potential reusers are given a large degree of freedom.

Open data should be made available under an open licence - see the Data licensing section below for more details.

Not all data is suitable for making fully open: if the content is confidential or sensitive, or if the researchers are intending to seek a patent, then there may be good reasons to restrict access. However, even when this is the case, it is worth considering whether some version of the dataset could be shared (for example, an anonymised, redacted, or aggregated version), or whether controlled access to the data could be permitted.

For more on open data, see the Open Data Handbook from the Open Knowledge Foundation.

Planning for sharing

Expand All

As a general rule of thumb, it is good practice to make as much data as possible available for reuse. At a minimum, you should aim to make all data which supports research findings or conclusions available, unless there are specific reasons for keeping the data private. But your project may well produce other data which is well worth sharing.

Try not to limit your thinking to the confines of the original research: while data may certainly be valuable to those working in a similar field to your own, it could also have applications which are harder to predict. Data can sometimes turn out to be useful to researchers in other disciplines, to members of the general public, or for educational or training purposes. Remember that much historical data was originally collected for reasons that had nothing to do with academic scholarship, but has subsequently proved to be a treasure trove for researchers. Sailors who kept weather logs a century or so ago probably never envisaged that their observations would feed into climate change research; similarly, you never know who might find your data priceless in the future.

Rather than asking which data should be shared, it may be helpful to turn the question on its head, and ask instead which data cannot be shared, with a view to sharing everything else. Projects which produce very large quantities of data may need to make practical decisions about how much it is feasible to share. Other reasons for keeping data private may include confidentiality, intellectual property issues, or plans to seek a patent. However, even if data cannot be shared openly, it is worth considering whether it could be made available in a restricted manner: see the section on this below for more.

Choices made early on in a project may influence how easy it is to share data later, and thus it's important to plan ahead, with preservation and reuse in mind right from the start. For example, it is essential that any research involving human participants secures appropriate consent, and it is much more straightforward to do this at the point when the data is collected, rather than trying to go back and do it retrospectively.

The sections below explore some specific aspects of preparing data for sharing.

If data contains sensitive or confidential information, this may be a barrier to sharing - especially if the wish is to share it without restrictions. For example, data which includes personal information about living identifiable individuals will need to be anonymised or pseudonymised, unless explicit consent has been obtained from the research subjects to share the non-anonymised version.

It should be noted, however, that deleting obvious identifiers (names, email addresses, and so on) may not be sufficient to fully anonymise a dataset: it may still be possible to deduce someone's identity by combining other pieces of information (a postcode and a rare medical condition, for example). Additionally, some types of data, such as video recordings, are very difficult to anonymise adequately.

For advice on anonymisation, you can contact the University's Information Compliance team.

Data may sometimes need to be redacted for other reasons: for example, it may deal with the location of endangered species, or may contain third party material to which someone else controls the intellectual property rights. It is helpful to indicate what sort of information has been removed and why (insofar as doing so is compatible with the reasons for which the data has been redacted), so that data reusers will be able to make sense of any gaps in the dataset. Redacting data can sometimes make it less representative: explaining the steps taken can help guard against inadvertent misinterpretation.

There are many approaches or techniques that may be used to render data less sensitive in some way, and what is appropriate may vary considerably from case to case. A balance needs to be struck between protecting participants (and the original researcher) or removing other information that could be misused, and not unduly degrading the data. If it is not possible or practical to produce a dataset that can be shared openly without significantly reducing the value of the data, it may be necessary to consider other options - for example, deposit in an archive which offers access restrictions.

Data is only useful to potential reusers if they can make sense of it. Documentation provides contextual information that helps new users to orientate themselves within the dataset, and to interpret it properly.

Information may be given about the dataset as a whole (in a README file, for example), or about specific aspects of the data (for example, clear labelling of variables, or annotations of actual or apparent anomalies). Both of these have an important role to play.

Documentation should cover:

  • When, where, for what purpose, and by whom the dataset was created
  • A description of the dataset
  • Details of methods used
  • Details of what has been done to the data - for example, has it been cleansed, edited, restructured, or otherwise manipulated, and if so, how?
  • Explanations of any acronyms, coding, or jargon
  • Recommendations or requirements on usage and citation (a repository will deal with this if one is used)
  • Any other notes which will help with proper interpretation

It can be helpful to ask a colleague who has not previously worked closely with the dataset to review the documentation: it is often easier for someone with an outside perspective to spot gaps, or to identify things that need additional clarification.

In addition to providing documentation which can be used alongside the dataset to aid comprehension, it's also important to have good metadata, or data about the data. This often takes the form of a catalogue record which describes the dataset as a whole. Providing appropriate metadata helps to make a dataset more discoverable, thus increasing the chances of reuse. Rich metadata is a key aspect of the FAIR principles for data - see the section below for more details.

If data includes third party material (if, for example, it includes content from a number of pre-existing datasets), the rights holder(s) may have imposed restrictions on what can be done with it. It is important to abide by any terms of use when sharing data, and to ensure that any conditions that apply to subsequent reusers are made clear.

Some potential difficulties with using third party data can be alleviated by advance planning. For example, if you are using some third party data that cannot be shared, a combined dataset might be structured in a way that makes it easy to separate this from the rest of the material. This allows the rest of the content to be shared in an appropriate manner at the end of the project - along with a full citation for the non-shareable third party material.

If you have questions about the reuse of third party material, you can contact the Bodleian's Copyright and Licensing Specialist.

FAIR data

Expand All

The FAIR Data Principles were defined in a 2016 article in Scientific Data. They are designed to promote: 

  • Findability
  • Accessibility
  • Interoperability
  • Reusability

The FAIR principles emphasise machine-actionability: growth in volume and complexity mean that computational support is increasingly necessary when locating and dealing with data. Among other things, the principles promote the use of rich metadata, persistent identifiers, data licences, and shared vocabularies and community standards. The full FAIR principles are reproduced in the section below.

While a key goal of the principles is to promote reuse, FAIR data is not always open data. If discovery is facilitated by rich metadata, and clear details of the process for applying to access the data are provided, then even a sensitive dataset which cannot be made openly available can achieve a high degree of FAIRness.

The FAIR principles, as given on the GO FAIR website, are as follows:

Findable

The first step in (re)using data is to find them. Metadata and data should be easy to find for both humans and computers. Machine-readable metadata are essential for automatic discovery of datasets and services, so this is an essential component of the FAIRification process.

F1. (Meta)data are assigned a globally unique and persistent identifier

F2. Data are described with rich metadata (defined by R1 below)

F3. Metadata clearly and explicitly include the identifier of the data they describe

F4. (Meta)data are registered or indexed in a searchable resource

Accessible

Once the user finds the required data, she/he/they need to know how they can be accessed, possibly including authentication and authorisation.

A1. (Meta)data are retrievable by their identifier using a standardised communications protocol

A1.1 The protocol is open, free, and universally implementable

A1.2 The protocol allows for an authentication and authorisation procedure, where necessary

A2. Metadata are accessible, even when the data are no longer available

Interoperable

The data usually need to be integrated with other data. In addition, the data need to interoperate with applications or workflows for analysis, storage, and processing.

I1. (Meta)data use a formal, accessible, shared, and broadly applicable language for knowledge representation.

I2. (Meta)data use vocabularies that follow FAIR principles

I3. (Meta)data include qualified references to other (meta)data

Reusable

The ultimate goal of FAIR is to optimise the reuse of data. To achieve this, metadata and data should be well-described so that they can be replicated and/or combined in different settings.

R1. (Meta)data are richly described with a plurality of accurate and relevant attributes

R1.1. (Meta)data are released with a clear and accessible data usage license

R1.2. (Meta)data are associated with detailed provenance

R1.3. (Meta)data meet domain-relevant community standards

The principles refer to three types of entities: data (or any digital object), metadata (information about that digital object), and infrastructure. For instance, principle F4 defines that both metadata and data are registered or indexed in a searchable resource (the infrastructure component).

How to share research data

Expand All

When selecting a method of data sharing, there are a number of things to think about:

  • Ease of data discovery
  • Ease of data access
  • Sustainability of data sharing solution

The ideal data sharing solution makes it straightforward for interested parties to find out about the data and to acquire a copy, and will ensure that the data remains available long into the future.

Whichever method is adopted, it is good practice for research publications which make use of the data to include a data availability statement. This is a brief note which either indicates where and how the relevant data can be accessed, or if some or all of the data is not available, explains why this is.

The pros and cons of some common methods of data sharing are discussed below. Further information can be found in the Options for preserving your data section of the Post-project preservation page: a number of the options listed there can be used for sharing data.

Data archives and repositories

One of the best methods of making data available for reuse is to deposit a copy in a specialist archive or repository. Archives exist for the specific purpose of preserving and sharing data, and as such, they are well equipped to make data discoverable, accessible, and sustainable.

Data archives are covered in more detail on the Post-project data preservation page.

Re3data and FAIRsharing both maintain extensive catalogues of data archives. Oxford has its own institutional archive for research outputs, including datasets, ORA

Many data archives are able to apply access controls to data deposits, and hence may be a good option for data which is not suitable for completely open sharing. This is covered in the Restricting access to data section below.

Other institutional services

Oxford's Sustainable Digital Scholarship service offers a Figshare-based platform designed for sharing research data. As the name suggests, it is designed to provide a long-term home for data, and ensure that it remains available well into the future.

Oxford's institutional Open Science Framework (OSF) offers a complete platform for managing your research, from planning through to publication. It is now open to members of all divisions and departments across the University, and you can access it using your Oxford SSO credentials.

General repositories and online sharing platforms

A number of services exist with the goal of making it easy to share research material (including but not limited to data) online. These may be publicly-funded services or commercial ones, and include Zenodo and the main Figshare service (as distinct from Oxford's Figshare-based SDS).

These platforms vary a good deal, so it's important to check the terms and conditions carefully. They can provide a quick and convenient way of making data and other materials available, but data may not be as easily discoverable, and sustainability is not always guaranteed. They are also generally less likely to offer active curation of data or access controls than specialist repositories or institutional services.

Project website

If your research project has a website, it may be appropriate to host a copy the data there. This can be an effective way of sharing the data with a wider public, and may allow you to offer features that would not be available via a data archive, such as a custom search interface.

However, it is not advisable to rely on this as the sole method of making data available for the long term. Maintaining a website after a project concludes presents a number of challenges: funding bodies are generally reluctant to cover costs incurred after the end of the grant period, and project team members are likely to move on to other endeavours, and hence it is hard to predict how long a project website will remain viable for. If at all possible, a project website should therefore be seen as an additional method of sharing data alongside depositing a copy in an archive, rather than as an alternative to it.

Supplementary material to a journal article

In some fields, researchers may provide data files to be published alongside the journal article which presents the conclusions drawn from the data. This makes it very easy for readers to access the data, and has the advantage of presenting it in context. If the journal is a well-established one, it is also likely that the data will remain available for a considerable period of time (though it is worth checking the journal terms and conditions to see whether this is guaranteed).

However, there are once again reasons not to rely on this as the sole method of data sharing. While making the data available in this way is convenient for readers, it may be harder for other interested parties to discover the data. Additionally, the data relevant to a particular article will frequently only be a subset of the data produced by a research project. Where possible, it is therefore good practice also to deposit a fuller version of the dataset in a suitable archive.

Data available on request

It has been common in the past for researchers to add a note to research publications saying that data is available on reasonable request from the authors. However, this is not an ideal method of sharing data: it relies on potential reusers being able to contact the original researchers, which may be difficult if some time has passed and contact details have changed.

Even if the data is very sensitive, and a custom data sharing agreement would be required for any reuse, it is better if possible to have this process mediated by a specialist archive, rather than placing the responsibility on individual researchers. Some archives have processes in place for relaying requests back to data creators when necessary, and it may be possible to reach an agreement in advance about what action (if any) should be taken if the data creators cannot be located.

If no other sharing solution is viable, making the data available on request is preferable to not making it available at all, but it should nevertheless be viewed as a last resort, and only be adopted once other possibilities have been exhausted.

A licence clarifies the terms of use for your data – disentangling what can otherwise be quite a complicated default legal position.

A licence is a formal statement issued by the holder of the rights to a particular work (e.g. a database or other dataset), giving permission to use the work in certain specified ways. You might, for example, specify that if your data is reused, you must be cited as the creator, or that the data may be used for research and educational purposes, but not in commercial contexts.

Licences for data fall into two broad categories:

Traditional data sharing or collaboration agreements

These grant rights to specific individuals or entities. They typically take the form of a relatively formal contract, which will need to be agreed by both parties.

If you need a data sharing agreement, you should consult Research Services. They have a range of template agreements, and will be able to help tailor one to your specific needs.

Open licences

These grant rights to anyone, often subject to certain minimal conditions such as attribution of the data’s creator. They usually take the form of a short statement which accompanies the dataset, sometimes with a link to more comprehensive information elsewhere.

If a potential reuser wishes to use your data for a purpose not covered by the open licence, they can still approach you directly to discuss the possibility - but the licence removes the need to do this in straightforward cases.

Commonly used open licences for data include Creative Commons and Open Data Commons. Alternatively, if you control all rights to your dataset, and wish to make it available for others to use without any restrictions at all, you might consider using a Creative Commons Open Data CC Zero public domain dedication and waiver.

The DCC’s guide How to License Research Data provides a useful overview.

An ORCID is a unique persistent identifier for an individual researcher. Using ORCIDs in documentation allows research datasets to be unambiguously linked to their creators, even if they have subsequently moved to a new institution, or are now known by a different name.

The ORCID at Oxford service allows researchers to verify their Oxford affiliation in their ORCID record.

For more information, you can watch ORCIDs from Scratch (SSO login required), a recording of a presentation given at the Oxford Festival of Open Scholarship 2023.

Restricting access to data

Not all data is suitable for sharing. However, even if data cannot be shared openly, or shared immediately, this does not automatically mean it cannot be shared at all. The sections below explore some of the options.

Expand All

There are a number of legitimate reasons for not making data publicly available. Some of these have to do with the nature of the data itself (e.g. the data is confidential or otherwise sensitive), whereas others result from the nature of the research process (e.g. researchers may still be working on their primary analysis, or may be intending to seek a patent).

If you intend to share confidential data, you may need a data sharing agreement to be in place. Consult Research Services for more information on this.

If you are planning a patent application, it is important to avoid prior disclosure. Before sharing, seek advice from Oxford University Innovation.

It is generally acknowledged that researchers are entitled to a period of privileged access during which they can work on the data before making it available to others. However, if this period extends after the formal end of a project - because researchers are waiting for publications to appear, for example - the point at which the data becomes shareable may occur when researchers have already moved on to other endeavours. This is at best an annoyance, and at worst may make sharing significantly less likely to happen.

A convenient solution to this problem is to deposit a copy of the data with an archive which allows data to be placed under a fixed term embargo. This typically means that a metadata record will be available for the data, but the data itself will not be downloadable until the embargo has expired.

Another advantage of this approach is that the data is citable even before it is publicly available, and thus can be referenced in research publications and data availability statements.

The length of embargo that is deemed appropriate varies between disciplines and between funding bodies. Funders frequently stipulate that data should be made available as soon as possible; some specify a particular time frame (which can sometimes be shorter than researchers would like it to be), though there may be room for negotiation if there are good reasons to delay.

Some data archives can accommodate a range of different access restrictions. Common access controls include:

  • Requiring users to register and agree to a standard set of terms and conditions before accessing the data
  • Requiring users to submit an application explaining their intended purposes before access is granted
  • Permitting users to view the data only in a secure location, after having been vetted and having received training in its appropriate use

However, provision in this area varies considerably: not all archives will be able to offer all the options listed above, and some may offer others. It is thus important to investigate what is available from archives for data in your discipline at an early stage, so that you can plan for ultimate sharing with this in mind.

For an example of a repository offering multiple access tiers, see the UK Data Service's page on Access control. The UK Data Service makes use of the Five Safes framework: a set of principles designed to enable data services to provide safe access to data.

Some archives (including Oxford's own ORA) may ask you to nominate a data steward. This is a person who can take responsibility for answering questions (and where appropriate, making decisions) about your data in the event that you cannot be contacted. If possible, it is better to nominate the holder of a particular post rather than a named individual. Some departments and units may have a designated data steward, or there may be a senior researcher (e.g. the head of a research group) who would be well placed to take on the role; otherwise, a departmental administrator or subject librarian may be a good option.