What do we mean by a data archive?
A place to securely hold digital research materials (data) of any sort along with documentation that helps explain what they are and how to use them (metadata). The application of consistent archiving policies, preservation techniques and discovery tools, further increases the long term availability and usefulness of the data. This is the main difference between storage and archiving of data. ORA-Data is the University of Oxford’s research data archive.
An archive is for stable (completed) versions of the data not a research workspace. Once data are deposited, they remain in that state and are attributable via a persistent identifier. Active (‘live’) data that are constantly being worked on should not be deposited in an archive. Stable snapshots of longitudinal data can be deposited.
Why should I use a data archive?
This is an option when you have chosen to store your data in an actively curated environment for a significant period, and disseminate details of your data. An external archive can also be a solution when your funder or journal publisher encourages you to make your data accessible or link it to publications. Some journals, particularly in the bio-sciences, suggest named public archives.
Is there already an archive appropriate for your subject area?
Some disciplines are well served by established and well known data archives. Examples include the UK Data Archive, Dryad, GenBank, EMBL-EBI, and the Natural Environment Research Council (NERC) Open Research Archive. Deposit in some of these archives is dependent on funding body or publisher. Over a thousand specialist archives are listed in the Re3Data registry of research data repositories. Some Oxford departments also have well established data stores that have served their research group for a significant time. However, the fact remains that some disciplines do not have obvious locations for archiving data.
ORA-Data as an Oxford University data archive
As well as being a catalogue of data produced by Oxford researchers, ORA-Data is a data archiving service provided by the University of Oxford. ORA-Data accepts data from any discipline and especially data that underpins a publication. It also accepts datasets that must be deposited to comply with a funder’s policy, but where there is no suitable archive.
The aim therefore is not to hold all research data produced by Oxford researchers and ORA-Data will co-exist with established archives.
ORA-Data preserves stable versions of data and can assign a DOI to collections if necessary, making them citable. Each collection has a freely available online record so the data can be found. The metadata contained in these records complies with international standards. Often there are other means of storing data open to researchers but these may lack some of the benefits of depositing with an archive.
Sustainable Digital Scholarship
The Sustainable Digital Scholarship service is designed to allow researchers to store, work with, preserve, and share research data. The SDS platform, which is provided by Figshare, can be used both for live data storage and as a way of keeping research data safe for the long term and making it available to a wider public. The service is based in the Humanities Division, but is available to researchers from across the University: charges apply for non-Humanities projects, and for new Humanities projects which have not yet applied for funding.
In 2020, the University of Oxford launched a new digital preservation service, known as DigiSafe. This is designed to provide secure long-term storage for data which needs to be preserved, but which is not suitable for sharing. Thus it may provide a suitable home for some categories of research data (for example, identifiable patient records from medical research projects), but it does not offer a means of making data openly available for reuse. DigiSafe is offered on a subscription basis to departments, colleges, and other units, so access to the service is dependent on whether your unit has opted to subscribe.
Oxford University IT Services’ HFS Backup and Archiving Service
IT Services’ HFS will be known to many as the University’s central backup service: researchers can back up their computers either manually or on a scheduled basis and restore files or folders in the event of disaster, such as hard drive failure, or theft of equipment. But while the HFS backup service offers protection against loss of active data, it is not intended for long-term data archiving.
Until early 2021, HFS also offered a private archiving service. However, this was very different from ORA-Data: only the individual or research group that deposited the data could access it again in the future, minimal contextual information was stored, there was no means to search the archive, and data held therein cannot be referenced. The HFS Archive is no longer accepting new applications, and researchers currently using the service are encouraged to migrate their data elsewhere as soon as possible.
Personal storage on an external hard drive
An external hard drive is inexpensive but should only be seen as a temporary or short-term storage option. The difference between storing your data on a drive and in a reputable archive is that:
- A drive does not benefit from multiple backups in multiple locations
- Preservation and curation actions will not be carried out to ensure continued accessibility of content over time
- Using a drive places all the responsibility on the drive owner to do it properly
- The data creator suffers the consequences of it not being assigned a DOI
- The data creator suffers the consequences of it not being discoverable
- All hard drives fail eventually
This should be seen as an online equivalent of hard drive storage rather than as alternative to archival preservation. Pricing and convenience of use can be attractive but terms and conditions of each provider should be examined in detail. It is common for such services to clearly state they accept no liability for security breaches or data loss, and there is usually no guarantee that data will be preserved for the long term. Commercial cloud storage is generally unsuitable for data which is in any way sensitive or confidential (in other words, you should not use it for any data which you would not be happy to make available on the open web). In addition, most cloud storage services do not offer appropriate citation or access options.
Criteria for selecting a data archive
1. Long term accessibility?
A data archive should agree to store your data for a significant period of time. It should also undertake to ensure data will remain findable and accessible for this period. It should give details of scenarios and procedures where data will be removed or deleted. Check the policies and terms and conditions of the archive carefully to ensure that it will retain your data for as long as you require, or at least give you sufficient notice of removal.
If you sign up to a commercial service as a private individual, there may be little or no guarantee of long-term access. For example, Figshare’s terms and conditions warn that the company “may terminate your access to all or any part of the Service at any time, with or without cause, with or without notice, effective immediately, which may result in the forfeiture and destruction of all information associated with your account, including User Submissions” and that they reserve the right to “discontinue any part of the Service at any time without notice.” You will generally have more protection if there is a formal relationship between the University and the service provider.
Two of the minimum requirements for datasets, required by funders and journal publishers, is that the dataset can be cited and found. To this end, the dataset should have a persistent, meaningful and discoverable record. The metadata describing the dataset should be compliant with common standards. This is likely to include the DataCite minimum metadata set for data citation. ORA-Data enables data creators to assign rich metadata to their dataset that allows them to comply with funder and publisher requirements as well as to receive credit and acknowledgement for their data.
3. Digital Object Identifiers (DOIs)
It is rapidly becoming the norm that datasets require a DOI (the digital equivalent of an ISBN) so they can be cited and found. Some major funders recommend (although don’t require) that a DOI is used for the unique identifier. Does your selected archive assign DOIs to deposited datasets?
4. Meeting funder’s requirements
Does the archive enable you to meet your funder’s requirements? The requirements might include elements such as:
- Taking active steps to preserve the data for sharing
- Ensuring access for a given period (e.g. 10 years after last access)
- Assignment of sufficient metadata to describe and locate the data
- Assignment of a DOI or other unique identifier
The pros and cons of archiving, along with alternative solutions, are illustrated in the following table:
|Long-term preservation||Secure and multiple location backups||Documentation to support data||Data discoverability||Citation (inc DOI)||Online access to data|
|Sustainable Digital Scholarship||✔||✔||✔||✔||✔
(Intended for data that needs to be kept private)
|✘||Only for data depositor and others given specific permissions|
|IT Services HFS||✘
Access lost if you relocate to another institution
|✔||Unlikely and not structured to common standards||✘||✘||✘|
|Department or other local group storage||Will require investigation and confirmation||✔||Unlikely and not structured to common standards||✘||✔
(If linked to record in ORA-Data)
|Will require investigation and confirmation by you|
|Subject or National Archive e.g. NERC/ UKDA||✔||✔||✔||✔||✔
|Journal archive/ Dryad||Not guaranteed. Reliant on 3rd party terms and conditions||✔||✔||✔||✔
|Personal hard drive/Memory stick/Flash drive||✘||✘||✘||✘||✘||✘|
|Cloud Based Services||Not guaranteed. Reliant on 3rd party terms and conditions||Not guaranteed. Reliant on 3rd party terms and conditions||Some but generally not comprehensive||Limited||Some provide a DOI but not all||✔|
The Digital Curation Centre’s guide Where to keep research data also offers some helpful guidelines for evaluating data repositories.