Research data management glossary

This list explains some of the terms used to describe storage and management of digital research data. Some terms are interchangeable in common usage, and some have specific meanings in particular contexts.

Backup

An additional copy of digital data which is stored for use as a replacement in case the main copy is either deleted or corrupted. A backup service does not provide the same service as an archive: it is not generally intended for long-term preservation after the end of a project, and it does not provide access for data consumers or individuals other than the data owner and possibly IT support.

Content management system (CMS)

A system for which allows multiple users to easily create, edit, and publish content, usually on a website. Drupal and WordPress are examples of popular content management systems. The University of Oxford's Mosaic website-building platform uses a Drupal-based CMS.

There is increasingly an overlap between the capabilities of content management systems and document management systems.

Curation

Curation is the act of managing digital items held within an archive over the long term. It is an active process, implying positive steps taken by the curators to help items remain secure, discoverable, and accessible. Digital curation involves maintaining and preserving archived items throughout their life cycle: the active management of digital research data reduces threats to its long-term research value, and mitigates the risk of digital obsolescence. The process includes selection, appraisal, preservation, disposal, and transformation - for example, migration to an updated format.

Data dictionary

Documentation describing the contents, format, and structure of a dataset and the relationship between its elements. A data dictionary provides metadata about data elements: for example, it might include a table listing data attributes, with columns giving the attribute name, whether it is optional or required, the attribute type or format, and so on. An additional column for explanatory notes about each attribute is also helpful, especially if the data is to be shared with others. This could include a brief explanation of how the attribute is obtained or calculated.

Data management plan (DMP)

A data management plan, or DMP, is a formal document which outlines how a project will manage its research data throughout the whole project life cycle. This covers details of the type of data involved, how data will be gathered, stored, and backed up, how it will be accessed by collaborators in a secure way, and how any legal or ethical requirements will be met. It should also address the longer term questions of preservation and sharing.

Data package (ORA)

ORA employs the concept of ‘packages'. This means that all the elements that comprise a dataset are stored together as one item. In ORA, a package might comprise a number of spreadsheets, some images, documentation, a README file describing the data and data collection methodology, the licence associated with the data, and the metadata describing the digital object.

Data snapshot

A copy of a dataset as it was at a particular point in time. If a dataset is constantly evolving, it may be helpful to take regular snapshots for backup purposes, or to allow the history of the data to be traced. Some types of data, such as those generated by longitudinal studies, may take decades to reach a stable form (or may never do so), and hence taking periodic snapshots allows the data to be archived, shared, and cited.

Database

A database is a structured set of data, accessed via a database management system (DBMS). The goal is to make it possible for the data to be easily queried, allowing users to locate a particular piece of information, or to answer more general questions (such as 'How many items of type x are recorded here?', or 'How many items of type x also have property y?')

There are various types of database, providing different ways of structuring information. The most common type is a relational database, which consist of a set of connected tables which record the properties of, and the relationships between, entities of various types. Microsoft Access and MySQL are examples of relational database management systems. XML databases, such as eXist, are designed to work with information that has been tagged with XML. Document-orientated (or noSQL) databases are flexible systems which do not require the attributes of objects to have a consistent structure. MongoDB is an example of a document-orientated database.

Databases are useful tools for supporting research, as they enable researchers to model and organise complex datasets, and then to interrogate the data in a range of ways. Research Data Oxford can advise on database use, and the IT Learning Centre offers regular training on working with databases.

Dataset

A general term often used to describe a collection of research data. A digital dataset might comprise a single item such as a spreadsheet of numerical data, or it might be much larger, comprising a collection of related items such as spreadsheets, images, the readings on a particular day from a scientific instrument, or a mixture of these and many other types of data. See also data package (ORA).

Digital object

A digital object is a specific digital ‘thing’. It may comprise a single file, such as a research publication, with its associated metadata, or it may be a package containing multiple files and metadata.

Digital objects are frequently assigned identifiers, which distinguish them from other similar objects, and can be used for citation purposes. For example, each digital object in ORA, no matter what type, is assigned a UUID, or Unique Universal Identifier. Digital Object Identifiers, or DOIs, are also assigned to many research data items.

Digital Object Identifier (DOI)

A DOI is a particular type of persistent identifier assigned to digital items such as an article or a dataset, to enable them to be located and cited. It is standardised by the International Organization for Standardization (ISO).

DOIs can be incorporated into URLs so that users can always access the digital content, even if it has moved online location. If the content is unavailable, the DOI should still resolve to a record for the item. Publishers use DOIs to identify articles: for example, the DOI 10.1103/PhysRevLett.107.133902 is incorporated into the publisher’s URL: http://link.aps.org/doi/10.1103/PhysRevLett.107.133902. An item can always be traced using the DOI by using it with the prefix http://dx.doi.org/ (in this case, giving the URL http://dx.doi.org/10.1103/PhysRevLett.107.133902).

The Bodleian Libraries holds a contract on behalf of the University that permits the Libraries to assign DOIs to Oxford research data items (see the Bodleian DOI policy for more details). However, this does not extend to publication-like items such as articles or books.

Digital preservation

The process of storing the bits and bytes that comprise digital objects. Preservation does not necessarily imply continued access.

Preservation is an important part of prolonging the life of research data, but is not sufficient by itself. Simply storing data files without actively managing them can result in data which still exists, but which is unusable. For example data integrity checks may not have been carried out, bit-rot (data decay) may have made the data unusable, or the software needed to open the files may no longer be available.

Document management system (DMS)

A computer system to enable efficient management of large quantities of documents whilst they are in active use. DMSs can be used to manage the creation, storage, editing, version control, and disposal of documents. Such systems are usually accessible by multiple users (sometimes with different permission levels), and so can facilitate sharing of research materials.

SharePoint is an example of a popular document management system available to research groups at the University of Oxford.

Documentation

Contextual information provided with data to enable users to make sense of it and to interpret it properly. Documentation may relate to a whole dataset (e.g. a README file that accompanies the data files, or a detailed description of data gathering methods), or to specific aspects of it (e.g. labelling of columns in a spreadsheet, or annotation of apparent anomalies in the data).

Embargo

If a dataset deposited in an archive has an embargo placed on it, it means that the dataset is not accessible. Typically, there will be a metadata record describing the data, but the data itself will not be available. Embargoes may be permanent, or for a fixed period of time. Researchers may sometimes choose to deposit a dataset at the end of their project, but to embargo it for a further period - for example, until publications which make use of the data have appeared.

FAIR data

Data which meets a set of principles for data management and stewardship established by a consortium of scientists, and endorsed by the G20 Hangzhou summit in 2016. FAIR is an acronym, standing for Findable, Accessible, Interoperable and Reusable. You can find more details in the original paper outlining the principles, or on the GO-FAIR website.

HFS Backup

A backup service provided by IT Services. The HFS backup service is available to University of Oxford staff and postgraduates running Windows, macOS, and Linux.

Further information about the service is available from the HFS web pages.

Licence

A statement about an item (such as a creative work or a dataset) which indicates what potential users may and may not do with it. Some licences are custom-written formal legal contracts which need to be signed by both the owner of the item and the reuser. Others are open licences, which grant reuse rights to anyone, sometimes subject to conditions such as attribution of the data creator or a requirement that any derivative works are made available with a similar open licence. Creative Commons and Open Data Commons are examples of open licences.

Linked data

Structured digital data which is connected to other digital data, often using common web technologies such as HTTP, RDF, and URIs. Linked data is designed not just to be comprehensible to humans, but also to make information machine readable. For example, a sentence such as 'Charles Dickens wrote Oliver Twist' might be represented as "'Charles Dickens' isAuthorOf 'Oliver Twist'", with each element of the statement being given a unique machine-readable identifier that points the machine (and the reader) to a page that clearly explains who or what they are - thus making it easy to discover that Charles Dickens was a person, while Oliver Twist is a novel.

Linked data is increasingly being used by groups such as the BBC, The Guardian, and the UK Government, and underpins the notion of the Semantic Web. Tim Berners-Lee states that 'The Semantic Web isn't just about putting data on the web. It is about making links, so that a person or machine can explore the web of data. With linked data, when you have some of it, you can find other, related, data.'

The Bodleian Libraries publishes much metadata as linked data.

Live data

Data that is currently being worked on as part of a research project. The files containing the data will be regularly accessed, and may be amended or updated as new data is gathered or processed.

A snapshot of live data can be archived to create a stable, citable version.

Metadata

Literally, data about data - for example, data that describes an item such as a dataset.

The term is sometimes used interchangeably with documentation, but often means information with a defined structure, and which is designed to be machine readable. A metadata schema or standard is a set of pieces of information about an object recorded in a consistent way. For example, metadata for a research dataset might include fields for the author or creator of the item, the title, the date of creation or publication, the publisher, a unique identifier, and so on. The type of metadata it makes sense to record may also be more specific to the type of data: for example, metadata for a digital photograph file might include information about the light conditions, lens, and location of the camera when the image was taken.

Having metadata associated with a dataset helps the dataset to be found and cited, and provides other researchers with the information they require to understand the data. The Bodleian Libraries can offer expert advice regarding metadata: contact Research Data Oxford for further information.

Open access

Open access research outputs are those which are made freely available online, without access charges or other barriers. A major focus of the open access movement has been academic literature such as journal articles, but a wide range of research materials - including research data - are available on an open access basis.

ORA

ORA is the University of Oxford's repository for research outputs, including data, and is managed by the Bodleian Libraries. The service archives, preserves, and enables the discovery and sharing of data produced by Oxford researchers. Any type of digital research data, from across all academic disciplines and in all formats, may be deposited in ORA (apart from non-anonymised personal or confidential data). Each dataset deposited in ORA has a metadata record describing it, and DOIs can be assigned if desired. Datasets can be embargoed if required.

ORA is aimed at researchers who need somewhere to deposit research data (especially data that underpins publications, and data where the funding body requires archiving and preservation), and also at researchers who are depositing their data elsewhere (in a discipline-specific archive, for example) but who wish to include an entry for their dataset in the University's catalogue of research data.

Until 2023, the part of ORA which dealt with datasets was known as ORA-Data. However, following an integration of the deposit processes, use of the separate ORA-Data label is being phased out.

More information about ORA is available from the LibGuide and the deposit interface.

Personal data

Any data about living, identifiable individuals. Personal data must be handled in accordance with the relevant legislation, including the UK General Data Protection Regulation (GDPR). The University's Compliance website offers extensive guidance on handling personal data.

Preservation

See digital preservation.

Repository

A service in which research data can be deposited and preserved safely for the long term. Data in repositories may be open access, or restricted. The term is frequently used interchangeably with archive.

Reproducible research

Research that is conducted in a way designed to allow the same analysis to be repeated multiple times, with the same results. This often means that rather than simply publishing a paper containing the conclusions of the research, the materials that allowed those conclusions to be reached are also made available. Typical examples include compendia of data, code and text files, often organised around an R Markdown source document or a Jupyter notebook. This allows subsequent researchers to retrace the steps taken, validate the results, and then to build new work on this foundation.

Research data

The Digital Curation Centre defines research data as 'Representations of observations, objects, or other entities used as evidence of phenomena for the purposes of research or scholarship'. Research data can take many different forms, depending on the field of study: it may be numerical, textual, consisting of images or audio-visual data, or it may be something else entirely. Some research data is highly structured (e.g. tabular data); some is unstructured; some is somewhere in between.

The University of Oxford Research Data Management Policy is concerned with a particular subset of research data, namely 'the recorded information (regardless of the form or the media in which it may exist) necessary to support or validate a research project’s observations, findings or outputs, or which is required for legal or regulatory compliance'.

Research data management (RDM)

The process of managing research data and the services and policies that support these activities. RDM is an umbrella term, covering a range of activities which stretch throughout the research life cycle. It includes (but is not limited to) data management planning, collection, organisation, storage, security, backup, documentation, compliance with relevant legal or ethical requirements, preservation, and sharing.

Good RDM is a critical element of research in all disciplines. Many funding bodies now have data management policies, which may impose certain obligations (e.g. creating a data management plan, or archiving data at the end of a project) on researchers.

Research information

Information about research, as opposed to data produced as a product of research. Research information can include details of the people, places, funders, activities, and other entities that form part of the research process. It describes who does what research, with whom, where, with what outputs, and funded by whom.

Research information management (RIM)

The activity of managing research information. It is the process of keeping information about research current and making sure that those who need access to the data are able to obtain it. This might be achieved by using systems such as finance or human resources databases, or specialist tools such as Symplectic Elements. From an information manager’s perspective, good information management is best achieved by having a single source that can be kept up to date and accurate: all other systems and users then obtain the data from this single authorised source. This approach also removes the need to re-enter information more than once.

Sharing

In the context of research data, sharing can mean two related but distinct activities. First, it may refer to providing multiple people within a project team with access to live data, as part of the active phase of a research project. Secondly, it may refer to making stable data more generally available for reuse (though not necessarily with no restrictions at all), often after the project concludes. Sharing of the latter sort is often done via an archive.

Snapshot

See data snapshot.

Stable data

A term used to refer to a completed dataset, which is no longer being added to or edited. Stable data may be archived at the end of a project, and (where appropriate) shared for reuse by others.

Version

A dataset as it is at a particular point in time. Datasets frequently evolve through a project: they grow as more data is gathered, and they change as data is edited, processed, and manipulated. Thus there may be many different versions of one dataset.

Version control

The process of tracking and managing different versions of a dataset. In straightforward cases, this may be done simply by labelling each version appropriately. A numerical tag may be attached to a dataset (e.g. in the file name or metadata) and changed each time it is updated, or the label may specify the type of version - for example, raw data as collected, processed data, or a subset of anonymised data generated as supplementary data for an article. More complex projects may benefit from the use of specialist version control software, or a document management system which includes version control features.

ORA can accept updated versions of datasets: the metadata record remains mostly the same, but a new version number is included. However, at present it cannot offer versioned DOIs.

Warm data

This term is used to refer to datasets which are neither 'hot' (i.e. live) nor 'cold' (i.e. stable), but somewhere in between. For example, a research project may produce an output such as an online database. After the end of the project, the work of actively developing the database may cease (perhaps apart from occasional corrections or minor updates, as resources permit), but the database remains useful to researchers from the original project and beyond - perhaps even to a large community of users around the world. This is particularly common in humanities disciplines, where a database (of, for example, historical information) may still be widely used decades after it was first created.

Having the database hosted online, with a custom search interface, makes it much more easily accessible than a copy in an archive, which each individual reuser would need to download, set up on their own machine, and learn how to use. However, ongoing hosting of this sort can be difficult to sustain: funding bodies will generally not cover costs incurred after the formal end of the project, and the members of the project team are likely to have moved on to new endeavours. Services designed with warm data in mind - such as Oxford's Sustainable Digital Scholarship service - aim to meet this need by providing affordable long-term solutions.

XML (Extensible Mark-up Language)

XML is a markup language that renders elements in a document machine-readable, enabling computers to analyse and perform operations on texts. XML works by tagging particular characters, words, or passages in a text, so that a computer knows that they have certain characteristics, or belong to a certain set, and should therefore be processed in a particular way when instructed. For instance, if a particular word is a name, it can be enclosed in tags that indicate the start and the end of the name, e.g. <name>Henry</name>. XML is very useful when analysing large corpora of texts.

Metadata about datasets is also often expressed as XML, to ensure that machines can interpret contextual information about data according to a particular standard.

TEI XML is a standardised vocabulary of XML tags that is frequently used by researchers working with texts. For further guidance, contact Research Data Oxford.