Post-project data preservation
The overriding reason for preserving data is that it is an important resource in its own right - and one which should not be abandoned once a project concludes. Researchers invest significant time and effort in collecting, collating, cleansing, and structuring data, and it is appropriate for this to be recognised.
Creation of a representative and well documented dataset is part of good research practice, providing a foundation for analysis and continued use. Preserving data allows the conclusions reached in the course of a research project (featured in journal articles, books, theses, conference presentations, and other outputs) to be supported or validated, and helps to make research reproducible.
It is very rare for the full value of a dataset to be mined in the course of a single project. Active curation transforms data stored for short term use into preserved data with a future: it ensures researchers will continue to be able to access and make use of it long after a project has finished. Preserving data allows further potential to be tapped in the future, by the original creators or others.
Where it is appropriate, widening access to data gives researchers the opportunity to increase the visibility and impact of their work. If datasets can be cited, this helps the creators of the datasets to get proper credit for their work.
Making data available for reuse by others is covered in more detail in the Sharing data section.
There are regulatory requirements to preserve some kinds of data (for example, information about patients from some medical studies) for a minimum period after the research concludes. Funding bodies, universities, and other institutions also recognise the value of data preservation, and consequently may have policies which cover this area.
University of Oxford requirements
The University of Oxford's Policy on the Management of Data Supporting Research Outputs stipulates that researchers should preserve and provide appropriate access to any research data which underpins a research output for as long as it has continuing value - but for a minimum of three years after publication or public release of the research.
Researchers are also strongly encouraged to consider depositing their data (along with sufficient descriptive metadata) in an appropriate repository or archive, and to ensure that there is at least a data record in ORA, the University's own archive for research outputs, including datasets.
The University provides support to help researchers comply with this policy. For more information, see the Policy section.
Many funding bodies now require that data is preserved for a specified period (often between three and ten years) after the end of the project, and made available for reuse where this is appropriate. Certain funders may also require you to use a specific repository for storing your data. You can find out about the policies of different funders in the Funder requirements section.
Planning for preservation
The creator of a dataset is usually best placed to decide what needs to be preserved. This will be based on a combination of:
- Knowledge of and insight into the data
- Consent or licensing agreements applying to the project
- Funders' and research institutions' requirements for data management and preservation
If a project is jointly funded, there may be multiple sets of expectations which need to be met.
The absolute minimum will be preservation of the data that underpins the results or conclusions presented in the project's other research outputs. However, many projects will produce additional data which is also well worth preserving.
Selection for long-term preservation should be based on:
- What is needed to validate research outputs
- Ethical, legal, or other regulatory reasons to retain or destroy data
- How difficult or costly it would be to reproduce the data
- Value for future reuse
When considering the potential for reuse, it's worth thinking outside the confines of the original research. Some data may be of interest to researchers in other disciplines, or to members of the general public, or it may be of use for educational or training purposes.
Thought also needs to be given at an early stage to the costs of preserving data, so that these can be included in the funding application. For example, it may be necessary to budget for additional time and effort to prepare data for preservation, and some data archives levy a charge for deposits.
Most funding bodies will cover reasonable costs, as long as these are incurred during the lifetime of the grant. You can check with your funder what support is available. Further information about funders' data management policies is the Funder requirements section.
The process of curating data for preservation involves a number of key concerns: making sure the data remains usable for as long as possible, meeting regulatory requirements, and facilitating appropriate reuse. The last of these is dealt with more fully in the Sharing data section.
Data can only continue to be useful if it is possible first to access it, and then to interpret it properly. This requires appropriate technical choices of file formats, and good documentation providing a clear description of data.
Open formats are those which can be read by a wide range of software packages, rather than requiring a specific proprietary application to make use of them. Using them helps to safeguard the longevity of data, as they allow it to be accessed even if the program used to create it becomes obsolete in the future. Open formats also help to facilitate data reuse, as they reduce the need for future users to have access to expensive software.
Open formats include plain text files, CSV files for tabular data, TIFF or PNG for images, MP3 for audio files, and HTML or OpenDocument for text.
Proprietary software packages will often permit data to be exported in an open format, although this may involve the loss of some formatting or other features, which may reduce the usefulness of the data. In this sort of case, it is worth considering preserving copies of the data in both the proprietary file format and an open format. This allows those with access to the proprietary software to make use of its additional capabilities, but means at least a basic copy of the data is available to all.
Lossy versus lossless formats
Lossy formats are those where the process of data compression involves permanent destruction of some information. This results in a smaller file size, but also means that the original version of the file cannot be perfectly reconstructed.
If possible, lossless formats should be used for preservation. Where compromises need to be made (if, for example, storing a collection of very large files would be prohibitively expensive), then careful consideration needs to be given to the impact on future reusability. Lossy compression will make a more significant difference to some types of data than others: for example, audio recordings of spoken word can often be compressed without an audible loss of quality, when the same degree of compression of music would result in noticeable degradation.
Lossless formats include RAW and PNG for images, or FLAC for audio. Some formats, including TIFF for images and WAV for audio, can use either lossy or lossless compression methods.
To ensure data remains comprehensible and to reduce the risk of misunderstanding, it is important that everything is properly labelled and documented. A preserved dataset may be accessed many years after its initial creation, when memories of how it was developed or put together have faded: documentation can then be invaluable to serve as a users' guide and to provide context.
Documentation should aim to cover:
- Information about when, where, and by whom the dataset was created, and for what purpose
- A description of the dataset
- Details of methods used
- Details of what has been done to the data - for example, has it been cleansed, edited, restructured, or otherwise manipulated, and if so, how?
- Explanations of any acronyms, coding, or jargon
- Units of measurement
- Annotation of any anomalies (or apparent anomalies) where the reason for these is known
- Any other notes which will help aid proper interpretation
It is helpful to use informative file names, and for data to be structured in a way that makes it as easy as possible to navigate.
Datasets may sometimes need to be tidied or otherwise edited as part of the process of creating a preservation dataset. However, it is usually quicker and easier to document data as one goes along, rather than attempting to fill in all the gaps at the end of the project. Documentation written during a project to describe methodology, project progress, and other aspects of research activity can often be put to new use.
As a general principle, it is good to preserve as much data as possible. However, there are situations in which not all data can be kept. This may be for practical reasons (for example, because the quantity of data is such that it is not feasible to store it all), or because the data contains sensitive or confidential information which needs to be deleted after a certain point. Key considerations include:
- Honouring any commitments made to research participants (e.g. on consent forms)
- Compliance with data protection legislation
- Institutional expectations around ethical practice
For example, GDPR specifies that personal data should not be retained for longer than necessary. Researchers may thus sometimes opt to remove personal identifiers from a dataset at the end of the project, so that an anonymised version of the data may be preserved. It should be noted, however, that deleting obvious identifiers (names, email addresses, and so on) may not be sufficient to fully anonymise a dataset: it may still be possible to deduce someone's identity by combining other pieces of information (a postcode and a rare medical condition, for example). Additionally, some types of data, such as video recordings, are very difficult to anonymise adequately. Data creators will need to consider what can realistically be achieved without significantly reducing the value of the dataset, and then plan a suitable preservation strategy in the light of this.
The questions of which data should be preserved and of which data should be shared with a view to reuse need to be considered separately. Data which needs to be preserved but which is not suitable for sharing can be stored in a secure archive. In some cases, it may be appropriate to have multiple versions of a dataset: for example, an anonymised one which can be shared openly, and one retaining more personal information to which access is restricted. Making data available for reuse by others is covered in more detail in the Sharing data section.
One of the best ways of preserving research data for the long term is to deposit a copy in a specialist data archive.
An archive is a place to securely hold digital research materials (data), along with documentation that helps explain what they are and how to use them (metadata). The application of consistent archiving policies, preservation techniques, and discovery tools further increases the long term availability and usefulness of the data.
An archive is for stable (completed) versions of the data: it is not a research workspace, or a place for storing data that is still actively being worked on. This means that data is most often deposited towards the end of a research project. In the case of longitudinal datasets, or other long-term projects which continue gathering material over many years, it may be appropriate to deposit periodic snapshots of the data. Once datasets are deposited, they typically remain in the same state, and are citable via a persistent identifier.
Data archives are also known as repositories. We will use the two terms interchangeably.
Archives are designed to store data in an actively curated environment for a significant period, and to disseminate details of that data. They therefore offer significant benefits over attempting to host a preservation dataset on personal or departmental drives: in particular, they relieve the individual researcher of the responsibility of making sure the data remains available, and instead allow this to be handled by a body specialising in the curation of data.
For data which is suitable for reuse, they are also one of the best ways of ensuring that data is made available to as wide an audience as possible. Funders often encourage or mandate the use of archives, as do journal publishers, as they allow data to be linked to from publications.
Long term accessibility
A data archive should agree to store your data for a significant period of time. It should also undertake to ensure data will remain findable and accessible for this period. It should give details of scenarios and procedures where data will be removed or deleted. Check the policies and terms and conditions of the archive carefully to ensure that it will retain your data for as long as you require, or at least give you sufficient notice of removal.
Data archives supported by funders or by institutions such as the University of Oxford usually undertake to preserve data for substantial periods. This might not be the case with free third party services with which the University has no contractual agreement. Researchers are responsible for checking the terms and conditions to ensure that their chosen archive meets the necessary requirements.
Two of the minimum requirements for datasets, required by funders and journal publishers, are that the dataset can be found and can be cited. To this end, the dataset should have a persistent, meaningful and discoverable record. The metadata describing the dataset should be compliant with common standards. This is likely to include the DataCite minimum metadata set for data citation.
Digital Object Identifiers (DOIs)
It is rapidly becoming the norm that datasets deposited in an archive are given a DOI (a permanent identifier: the digital equivalent of an ISBN) so they can be cited and found. Some major funders recommend (although don’t require) that a DOI is used for the unique identifier.
Ability to restrict access to data
Sometimes a dataset is not suitable for sharing openly, but could appropriately be shared with specific groups (e.g. other researchers working in the field) or under specific conditions. Many archives are able to offer some level of restricted or mediated access to data, though provision in this area varies considerably.
Most archives allow depositors to place an embargo on their data, either for a specified period (until publications making use of the data have appeared, for example) or indefinitely. Some archives can also accommodate other types of restrictions: those interested in accessing the data may need to agree to a set of terms and conditions, apply for access stating their credentials and the use they intend to make of the data, or (for very sensitive data) perhaps even access the data via a designated secure terminal such as a SafePod.
If data is being deposited at a funder's request, it is important to ensure that the archive selected meets that funder's requirements. This is likely to include elements of the points listed above, for example:
- Taking active steps to preserve the data for sharing
- Ensuring access for a given period (e.g. ten years after last access)
- Assignment of sufficient metadata to describe and locate the data
- Assignment of a DOI or other unique identifier
While archives are generally the preferred option, in some cases, researchers may find that no suitable archive is available, or the data is subject to particular regulations concerning preservation and sharing which restrict where it can be deposited. The sections below therefore explore both archives and some alternatives.
If you would like to talk about selecting the most suitable option for your own data, please contact the Research Data Oxford team by emailing email@example.com.
ORA: the University of Oxford's repository for research outputs, including data
The Oxford Research Archive (ORA) is an archiving service provided by the University of Oxford. It also functions as a catalogue of data produced by Oxford researchers and deposited either in ORA or elsewhere.
ORA accepts data from any discipline, and especially data that underpins publications. It can provide a home for datasets that must be deposited to comply with a funder’s policy, but where there is no suitable national or discipline-specific archive. However, it is currently unable to accept deposits of sensitive or non-anonymised personal data.
ORA preserves stable versions of data and can assign DOIs to data collections if desired, making them citable. Each collection has a freely available online record, to aid data discovery. Data creators can assign rich metadata to their dataset, allowing them to meet funder and publisher requirements, and to receive proper credit and acknowledgement for their work.
ORA does not aim to hold all research data produced by Oxford researchers: it will co-exist with disciplinary and general archives. However, researchers depositing data elsewhere are strongly encouraged to create at least a metadata record in ORA.
DigiSafe is an opt-in subscription service designed to provide secure storage for data which needs to be preserved for short or long periods, typically a year or longer. It has strong features for adding metadata and preserving access to file formats even when the original software used to create the data is no longer available. Data access is comprehensively logged and there is regular integrity checking of all data on the platform. Jupyter notebooks can be run to analyse data directly on the platform.
It is most useful for categories of research data which are not suitable for sharing (for example, identifiable participant records from medical research projects). Stored data can be easily searched and retrieved by users with the appropriate permissions. Built-in functions allow easy management of retention schedules - where material has to be deleted after a set amount of time, for example. DigiSafe is offered on a subscription basis to departments, colleges, and other units, so access to the service is dependent on whether your unit has opted to subscribe. Individual research groups who have secured funding are also welcome to sign up for the services in their own right. Whilst data can be shared directly from the platform, the functionality is quite basic.
Sustainable Digital Scholarship service
The Sustainable Digital Scholarship service is designed to allow researchers to store, work with, preserve, and share research data. The SDS platform, which is provided by Figshare, can be used both for collecting and editing data, and as a way of keeping research data safe for the long term and making it available to a wider public.
The service launched in the Humanities Division, but is available to researchers from across the University. Support and hosting are available free of charge to most pre-existing research projects seeking a more sustainable long-term home. For new projects which have not yet applied for funding, charges apply: quotations can be provided for support and hosting. These fees are only applicable during the funded phase of the project: once the active research period concludes, the data will be maintained on the system indefinitely without further charge.
Departmental data stores
Some Oxford departments have well established data stores that have served their research groups for a significant time. Because these are locally maintained, provision varies greatly: consult your local IT support staff to find out if your department is able to offer long term data storage. A departmental data store may be a good option in some circumstances (for example, if data needs to be preserved but is unsuitable for sharing, if you have very specialist requirements, or if no other archive for the type of data exists), but you should note that they are rarely able to offer the same level of data discoverability as institutional or national archives - though in some cases, it may be possible to create a metadata record in ORA, and use this as a signpost to the departmental data store. It is also important to check carefully what guarantees can be given about the length of time the data will be preserved for.
Nexus365 OneDrive provides secure cloud-hosted storage for members of the University of Oxford. However, whilst OneDrive is extremely convenient and has a relatively large quota, this data only remains live as long as your Oxford account remains active. It thus cannot be used to store data if you leave Oxford after the end of your project, and is not a suitable option for long-term data preservation.
Microsoft Teams and SharePoint Online
These services are also available as part of the Nexus365 service. As they can have multiple owners, they can in theory be used for storing data after one individual has left. If you are looking for a way of preserving data after you leave the University, this route will require the agreement of a permanent member of staff who could take over ownership responsibility, for example, your supervisor or head of department. However, these systems are not primarily designed to serve as long-term archives, and hence this should be considered only when no other viable options are available.
Over two thousand archives are listed in the Re3data registry of research data repositories, and FAIRsharing also maintains an extensive catalogue. However, it is unfortunately still a fact that some disciplines do not have obvious locations for archiving data.
Some disciplines are well served by established and widely known data archives. Examples include the UK Data Archive, Dryad, GenBank, EMBL-EBI, and the NERC Environmental Data Service. Some archives are linked to a particular funding body: deposit may be restricted to data from projects funded by that body, or it may be a condition of funding that data is offered to the archive.
Disciplinary archives offer a number of benefits. They are more likely to be known to other researchers working in the field, which helps make data discoverable, and they are also more likely to have specialist expertise in curating and working with the types of data typically generated.
Other archives and data stores
In addition to specialist disciplinary or institutional archives, there are also a number of general purpose archives which can be used to store data. These include both archives run by commercial organisations (e.g. Figshare) and publicly funded ones (e.g. Zenodo, which is associated with CERN).
These may be a good option in some cases (particularly if no subject-specific archive is available). However, it is important to check the terms and conditions carefully to ensure that the archive meets all necessary requirements. Archives in this category often cast the net very wide, and will take a very broad range of material. However, they may also be able to offer less support to researchers seeking to deposit, and may not be able to provide much in the way of active curation of data after deposit.
If one of the outputs of a research project is a website, it can sometimes be appropriate to host a copy of the data there. However, while this may be an effective way of sharing the data with a wider public, it is not advisable to rely on this as the sole method of preserving data for the long term. Maintaining a website after a project concludes presents a number of challenges (funding bodies are generally reluctant to cover costs incurred after the end of the grant period, and project team members are likely to have moved on to new endeavours), and hence it is hard to predict how long a project website will remain viable for. If at all possible, an additional copy of the data should therefore be deposited in an archive.
Third party cloud storage (i.e. services other than Nexus365 OneDrive, Teams, or SharePoint Online)
Compared to archival preservation, commodity cloud storage space will mean you will have to take far more responsibility for preservation and curation to ensure continued accessibility of content over time. Your data will also be significantly less discoverable, and is unlikely to be assigned a DOI.
Pricing and convenience of use can be attractive, but the terms and conditions of each provider should be examined in detail. It is common for such services to clearly state they accept no liability for security breaches or data loss. In addition, most cloud storage services do not offer appropriate citation or access options.
In some cases, cloud storage can provide an alternative way of making data publicly available: for example, it may be used to provide the data storage behind a search interface on a project website. However, as noted above, sustaining such websites for the long term frequently proves difficult, and it is therefore strongly recommended that researchers also consider depositing a copy of their data in an archive.
If you are intending to store data that should not be made fully publicly available, you will need to ensure that any third party cloud service has adequate data security. InfoSec can help you with the Third Party Security Assessment process.
Table of options for data preservation and sharing
|Long-term preservation||Secure and multiple location backups||Metadata to describe data||Data discoverability||Citation (including DOI)||Suitable for personal or confidential data||Active curation of data||Public online access to data|
As long as there is a means of paying the annual subscription
Basic sharing is possible
If desired, data can be shared, but private by default
|Sustainable Digital Scholarship service||✔||✔||✔||✔||
Though chiefly intended for data which will be made public
|Department or other local group storage||Will require investigation and confirmation||✔||Unlikely and not structured to common standards||✘||
If linked to record in ORA
|Will require investigation and confirmation||✘||Will require investigation and confirmation|
|Subject or national archive - e.g. NERC or UKDA||✔||✔||✔||✔||
Most offer a DOI service
Check terms and conditions
|Other archives - e.g. Zenodo||Check terms and conditions
(Zenodo offers at least 20 years)
Check terms and conditions
Check terms and conditions
Probably not - check terms and conditions
|Commercial data sharing platform - e.g. Figshare||Not guaranteed: reliant on third party terms and conditions||✔||✔||Variable||
Check terms and conditions
Variable: InfoSec TPSA required
|Journal archive||Not guaranteed: reliant on third party terms and conditions||✔||✔||✔||
Check terms and conditions (DOI services are common)
Check terms and conditions
|Check terms and conditions||✔|
Difficult to guarantee
|Variable: check terms and conditions||Unlikely and not structured to common standards||Dependent on discoverability of site||URL, but probably no DOI||
Generally designed for public data
|Nexus365 OneDrive for Business||
Access lost if you leave Oxford
Access only for owner and others given specific permissions
Requires Oxford user to take ownership responsibility
Access only for those given specific permissions
|Commercial cloud-based storage||Not guaranteed: reliant on third party terms and conditions||Not guaranteed: reliant on third party terms and conditions||Some but generally not comprehensive||Variable||Some provide a DOI, but not all||
Variable: InfoSec TPSA required
|Personal hard drive, USB flash stick or memory card||✘||✘||✘||✘||✘||✘||✘||✘|