Make your data count

Research data glossary

The list explains some of the terms used to describe storage and management of digital research data. Some terms are interchangeable with common usage, and some have specific meanings in particular contexts. The terms below are explained as they are used within the Oxford RDM services supplied by the Bodleian Digital Libraries Systems & Services (BDLSS), IT Services, and Research Services.

Archive

A service to record, organise, and store (digital) items in optimal conditions, with standardised labelling to ensure their longevity and continued access. The service is based on application of metadata, archiving policies, records management, and digital preservation actions. In the case of Bodleian digital archiving it also includes web-processing. Archivists make decisions on selection and retention of items which are usually governed by supporting policies. See also Digital Preservation and Curation. ORA-Data is the University of Oxford’s research data archive.

Backup

A copy of the digital data to be stored and used as a replacement in case the main copy is either deleted or corrupted. A backup service does not provide the same service as an archive, i.e. it does not provide for access by data consumers or individuals other than the data owner or IT support. See also HFS.

Content management system (CMS)

A computer system for enabling multiple users to share, edit, and publish content, usually on the Web. A CMS might underpin a website enabling many people who have been granted permission to add and edit content. WordPress and Drupal are examples of popular content management systems. There is increasingly an overlap between the capabilities of content management systems and document management systems (q.v.)

Curation

Curation is the act of managing digital items held within an archive over the long term. It is an active process, implying action on the part of the curators so that items remain secure, discoverable and accessible. ‘Digital curation involves maintaining, preserving and adding value’ to archived items ‘throughout their lifecycle. The active management of [digital] research data reduces threats to their long-term research value and mitigates the risk of digital obsolescence.’[1] Curation includes: selection, appraisal, preservation, disposal and transformation for example, migration to an updated format.

DataBank [Service]

Former name of the Bodleian digital research data archive superseded by ORA-Data.

Database

A database is a structured set of data, accessed via a database management system. For example, personal data might be organised and labelled as ‘Given name,’ ‘Family name,’ ‘Date of Birth,’ ‘colour of eyes’ and so on. The data can then be queried to answer questions such as, “how many people within a particular set have blue eyes?” There are various different types of database, providing different ways of structuring information. The most common type is a relational database, which expresses both the properties of, and the relationships between, different instances of particular objects. Microsoft Access and MySQL are examples of relational database management systems. XML databases, such as eXist, are designed to work with information that has been ‘tagged’ with XML (see XML). ‘Document-orientated’ (or ‘noSQL’) databases are flexible databases in which the elements of a particular object are not required to be consistent. MongoDB is an example of a document-orientated database.

Databases are useful tools for supporting research, as they enable researchers to structure data and then rapidly query it in a consistent manner. Advice regarding databases is available from the IT Services Research Support Group and BDLSS (Bodleian Digital Libraries Systems & Services). Enquiries sent to researchdata@ox.ac.uk will be assigned to the group best positioned to help. The IT Learning Programme (ITLP) offers courses on databases.

Dataset

A general term often used to describe a collection of research data. A digital dataset might comprise a single element such as a spreadsheet of numerical data; it could equally comprise a collection of related elements such as spreadsheets, images or the readings on a particular day from a scientific instrument or a mixture of these. See also data package.

DataStage [Protoytpe]

DataStage is software developed as part of the JISC funded DataFlow project to help researchers manage their ‘active’ digital research data prior to publication or archiving. It is a means for researchers to deposit selected data into their data repository of choice (providing that repository complies with the SWORD2 standard). DataStage is a secure personalized ‘local’ file management environment for use at the research group or individual level, appearing as a mapped drive on the researcher’s computer. It can be deployed on a local server, or on an institutional or commercial cloud. Users save files to DataStage just as they would on ordinary C: drive, but with added extras:

  • Private, shared and collaborative directories, with password-controlled access
  • Web access – work securely with stored files over the web, anywhere in the world
  • Users can add richer metadata via the web interface, using free-text ‘notes’ fields
  • All files can be automatically backed up via the usual backup service
  • Users can invite colleagues to access files made available to a defined group via password control
  • Repository submission interface makes it easy for researchers to define data packages, enter minimal metadata, and deposit them in a data archive of choice
  • Flexibility to dynamically invoke additional cloud storage as required

http://www.dataflow.ox.ac.uk/index.php/about/about-datastage

Digital Object

A digital object is a digital ‘thing’ which may comprise a single file and its associated metadata. A digital object could be a ‘person’ object, ‘project/award’ object or ‘organisation’ (such as funder), or ‘publication’ object. It might also be a ‘package’ containing multiple files and metadata. Each digital object in ORA-Data, no matter what type of object, is assigned a unique identifier (UUID – Unique Universal Identifier). A UUID is assigned to every item – DOIs are assigned to many research data items.

Digital Object Identifier (DOI)

A DOI is a persistent identifier that is usually assigned to a digital item such as an article or a dataset in order that the item can be found and cited. DOIs can be incorporated into URLs so that users can always access the digital content, even if it has moved online location. If the content is unavailable, the DOI should still resolve to a record for the item. Publishers use DOIs to identify articles, e.g. the DOI 10.1103/PhysRevLett.107.133902 is incorporated into the publisher’s URL http://link.aps.org/doi/10.1103/PhysRevLett.107.133902. However, to make it more persistent, an item can always be traced using the DOI by using it with the prefix http://dx.doi.org/ such as http://dx.doi.org/10.1103/PhysRevLett.107.133902.  The Bodleian Libraries hold a contract on behalf of the University that permits the Libraries to assign DOIs to Oxford research data items. The Libraries are not permitted to assign DOIs to ‘book-like’ publications such as articles or books.

Digital Preservation

The process of storing the bits and bytes that comprise digital objects. Preservation does not necessarily imply continued access. Keeping the bits and bytes comprising data safely without actively managing them can result in data which still exist, but which are unusable. For example data integrity checks may not have been carried out, and bit-rot (data decay) may have rendered the data unusable. See also Curation.

Document management system

A computer system to enable efficient management of large quantities of documents whilst they are in active use and editing. Such systems are usually accessible by many permitted users. They are used to manage the creation, storage, version control and disposal of documents. They make sharing of documents easy during the active life of the document. SharePoint is an example of a popular document management system available to research groups at the University of Oxford.

HFS [Hierarchical File Server, IT Services]

A backup service provided by IT Services. The HFS backup service is available to Oxford University staff, senior members, and postgraduates running Windows, Mac OS X, Linux and Unix. Servers may also be registered for back-up. Back-ups may be triggered manually or left to run automatically overnight. Three copies are made of each back-up, two of which are held in fire-proof safes. Over two petabytes of data are contained in the HFS system. The services is chargeable to users who wish to back up more than 4 TB of data.

Whilst the HFS does offer a long-term storage service, neither this nor the regular back-ups include any descriptive metadata about the data they back-up, nor is the data accessible to anyone other than the depositor and IT Services. Further information about the service is available from http://www.oucs.ox.ac.uk/hfs/.

See also Backup

Linked data

The Bodleian Libraries publish much metadata as linked data. This means that related data can be linked together with a meaningful description of the relationship so that it is possible to find other related data using machines and automated processes. For example, as a human being I can understand the phrase “Charles Dickens wrote ‘Oliver Twist.’” In machine-speak, it might be phrased ‘Charles Dickens’ isAuthorOf’ Oliver Twist.’ This would be fully expressed by each element having a unique machine-readable identifier that points the machine (and the reader) to a page that clearly explains what (who) Charles Dickens is. Linked data is being increasingly used by groups such as the BBC, The Guardian and the UK Government, and underpins the notion of the ‘Semantic Web’. Tim Berners-Lee states that “The Semantic Web isn’t just about putting data on the web. It is about making links, so that a person or machine can explore the web of data.  With linked data, when you have some of it, you can find other, related, data.”[2]

‘Live’ data

Data that is being worked on as part of a research project. The files containing the data will need to be accessed and amended or updated as new data is gathered or processed. A snapshot of live data can be archived to create a version that is no longer worked on and which is stable and can be cited. Some datasets are never ‘finished’ (eg longitudinal studies).

Metadata

Data that describes an item such as a dataset. Metadata labels a dataset with descriptive information such as Author/Creator; Title; Date; Publisher; Unique identifier and so on. Having metadata associated with a dataset enables the dataset to be found and cited. It provides other researchers with the information they require to understand the data. Metadata should comply with accepted international standards wherever possible. The Bodleian Libraries offer advice and expertise in metadata matters (email researchdata@ox.ac.uk)

ORA-Data [pilot service]

ORA-Data is the University of Oxford’s digital catalogue and repository for research data, managed by the Bodleian Libraries. It offers a service to archive, preserve and enable the discovery and sharing of data produced by Oxford researchers. Any type of digital research data, from across all academic disciplines and in all formats, may be deposited in ORA-Data. DOIs are assigned to data deposited in ORA-Data, and each dataset has a metadata record describing the dataset. Datasets can be embargoed if required.

ORA-Data is aimed at researchers who need a repository to deposit research data (especially data that underpins publications, and data where the funding body requires archiving and preservation), and researchers who wish to include an entry for their dataset in the University’s catalogue of research data (irrespective of the location of the data). There may be a charge for depositing data, depending on the size of the dataset and the funding status of the researcher.

More information about ORA-Data is available in the LibGuide: http://ox.libguides.com/ora-data.

The ORA-Data pilot can be accessed via ‘Data’ at http://ora.ox.ac.uk/information/contribute.

ORDS

The Online Research Database Service (ORDS) was an online database management system operated by the University of Oxford’s IT Services between 2014 and 2017.

Data Package

ORA-Data employs the concept of ‘packages.’ This means that all the elements that comprise a dataset are stored together as one item. In ORA-Data, a package might comprise a number of spreadsheets, some images, documentation, a README file describing the data and data collection methodology, the licence associated with the data, and the metadata describing the digital object.

Preservation

See digital preservation.

Research Data

Research data and records are defined as the recorded information (regardless of the form or the media in which they may exist) necessary to support or validate a research project’s observations, findings or outputs (University of Oxford Research data management and open data policy [3]).

Research Data Management (RDM)

The process of managing research data and the services and policies that support these activities. Good RDM is a critical element of research in all disciplines, particularly in cases where research funders require researchers to manage their data. See the University of Oxford RDM website for details and sources of advice and information:  http://researchdata.ox.ac.uk/

Research Information

Information (data) ABOUT research, as opposed to data produced as a PRODUCT of research. Research information can include data which describe the people, places, funders, activities and other entities that form part of the research process. It might describe who does what research, with whom, where, funded by whom.

Research Information Management (RIM)

The activity of managing research information. It is the process of keeping information about research current and making sure that those who need access to the data are able to obtain it. This might be achieved by using systems such as Finance databases, Symplectic, or Human Resources databases. From an information manager’s perspective, good information management is best achieved by having a single source of the data that can be kept up to date and accurate, and all other systems/users obtain the data from this single canonical source. This approach also removes the need to re-enter information more than once. An offshoot of good RIM is being able to use the data for business intelligence such as finding out who is doing research in a particular area, or to answer other business critical queries.

Version

The version of a dataset can refer to either:

i)           the numerical tag attached to a dataset as it is updated, or

ii)         the type of version for example raw data as collected, processed data, a subset of anonymised data generated as supplementary data for an article.

ORA-data accepts updated versions of datasets, i.e. where the record remains the same but where a new version number is indicated on that record and in the DOI. The most recent version is displayed by default.

Website

A website usually comprises two elements: i) the user interface that users see on their screen and with which they interact, and ii) the underlying data – although this may in fact be held elsewhere and pulled in by the website. When using ORA-Data, it is the underlying data that is archived: ORA-Data does not provide a service to host live websites.

XML (Extensible Mark-up Language)

XML is a language the renders elements in a document machine-readable, essentially enabling computers to analyse and perform operations on texts. XML works by tagging particular characters, words, or passages in a text, so that a computer knows that they have certain characteristics, or belong to a certain set, and should therefore be processed in a particular way when so instructed. A researcher may, for instance, indicate that a particular word is a name, by enclosing it with a tag that indicates the start of the name and another that indicates the end of the name, e.g. <name>Henry</name>. XML is very useful when analysing large corpora of texts.

Metadata about datasets is also often expressed as XML, to ensure that machines can interpret contextual information about data according to a particular standard.

TEI XML is a standardised vocabulary of XML tags that is frequently used by researchers working with texts. Advice relating to XML is available from the IT Services Research Support Group (research@it.ox.ac.uk) and from the Bodleian Libraries Text Technologies/TCP (Text Creation Partnership) team (digitalsupport@bodleian.ox.ac.uk).

 


[1] http://www.dcc.ac.uk/digital-curation/what-digital-curation

[2] http://www.w3.org/DesignIssues/LinkedData.html

[3] Available at http://researchdata.ox.ac.uk/