Data handling and acquisition

It's important to have good processes in place for collecting or creating your data, and for working with it during the active phase of your project. This will enable you to get the most out of your data, and will also help maximise its value for the longer term.

Expand All

Data collection principles

Issues pertaining to specific types of data collection are discussed in the sub-sections below, but there are a few general key principles to consider:

Check for existing data first

Before you start planning your data collection, it's good practice to check whether any relevant pre-existing datasets are available to you. See the Data reuse section below for more on this.

Data collection needs to be consistent and accurate

Written protocols or standard processes can help with this. These are particularly useful in research groups where multiple people are involved in collection, but are valuable even for lone researchers. It's also important that appropriate quality assurance measures are in place. Depending on the type of data being collected, this might include calibration of instruments, validation of data entry, repeating samples or measurements, or peer review of data.

Data should be made as straightforward as possible to work with

Depending on the nature of the project, this may apply to the data collection process itself, or to what is done with the data immediately after collection. The progress of the rest of the project can be smoothed through (for example) clear documentation and labelling, standardised file names, and a well-ordered file or folder structure. See the later sections on this page for more on these.

Data collection must be ethical and compliant

The collection process and any interactions with research subjects need to be conducted ethically. It's also important to make sure you're aware of relevant laws and regulations, and that you have a plan to ensure compliance with these. See the next section for more on this.

Data from or about human participants

Any research activity involving human participants requires ethical approval, even if no personal data is being collected. For further information, see the University Research Ethics pages on the Research and Innovation intranet site.

Any personal data - that is, data about an identifiable living individual - needs to be handled in a way that is compliant with the General Data Protection Regulation (GDPR). Any University of Oxford activity which involves processing data will need to make use of the Data Protection by Design framework, which may involve completing a Data Protection Impact Assessment (DPIA).

If personal data is being collected directly from participants (e.g. via a survey or interview), a privacy notice or participant information sheet should be used to provide participants with all the relevant information about what will happen to their data: see the Creating Privacy Notices page on the Compliance website for more details. There is also specific guidance on running Surveys and Focus Groups.

Additional information is available from the Data Protection and Research pages on the Research and Innovation intranet site.

If people other than the members of the research team are involved in data collection or digitisation (if, for example, a transcription service is used), additional steps or processes may be needed. These might include gaining ethical approval for the involvement of the additional personnel, ensuring that the participant information sheet mentions that they will be involved in processing the data, or a Third Party Security Assessment for an external service. In some cases, a confidentiality agreement or data sharing agreement may be needed: Research Services can advise on this.

This topic is covered in more detail in the Ethical and legal issues section of this website.

Surveys

Surveys and questionnaires are a common method of collecting quantitative (and sometimes qualitative) data. They may be conducted via an online platform, via electronic devices such as tablets, or using pen and paper.

As there is a high chance that data collected via a survey will include personal data, it is important that proper processes are put in place, and appropriate tools are used: see the general notes on Data from or about human participants above for more on this. The University's Compliance team also has guidance on Conducting Compliant Surveys.

Online survey platforms

Online survey platforms can offer a quick and convenient way of creating and running a survey. Two platforms are available free at the point of use to all University members: Jisc Online Surveys, and Microsoft Forms. These have both been approved by the University's Information Security team as suitable places to store University data, including confidential data. They offer solid, user-friendly functionality, though they do not have the more advanced features offered by some survey tools.

Other tools offering a wider range of features are available to specific groups within the University: for example, the Medical Sciences Division runs an instance of REDCap for its members, and some departments have subscriptions to other survey platforms (for example Qualtrics) for the use of their members. Consult your local admin staff, research support staff, or IT officer to find out what is available to you.

If you need to use a survey tool other than those provided by the University, a Third Party Security Assessment (TPSA) should be completed. A handful of survey tools (including Qualtrics and SmartSurvey) have already been through this process, and are deemed suitable for use with confidential data. New subscriptions should be arranged through your department, rather than via a personal account, so that a contractual relationship exists between the University (which ultimately has legal responsibility for data gathered in the course of University research) and the service provider.

The IT Services Survey Advice Service can provide advice about selecting a suitable online platform, and training is available via the Digital Capabilities Team.

Electronic devices

If a reliable internet connection is not available, it may be more convenient to conduct a survey using an app on an electronic device, such as a tablet or a smartphone. This may be a standalone piece of software, or it may be an offline app that works in tandem with an online survey platform (responses are stored on the device, and then synced with the online platform when internet access is available).

Thought needs to be given to data storage and handling both on the device, and after data has been transferred elsewhere. If at all possible, portable devices used for storing personal data should be encrypted: if this is not feasible, data should be transferred from the device to secure storage at the earliest opportunity, and the original unencrypted copy deleted. If an app provided by an online survey platform is used, the platform should be selected in line with the guidance given above. Qualtrics and SmartSurvey are examples of suitably secure survey platforms which offer an offline app.

Hard copy surveys

In some cases, the traditional method of collecting survey responses using pen and paper may be most appropriate. Responses may then be digitised, either by scanning, or by entering the data manually into an electronic survey platform or database. As with electronic collection, care needs to be taken to ensure that the data is handled properly both during collection and subsequently. For example, it may be necessary to store completed survey forms in a locked filing cabinet, and any involvement of non-University personnel in collection or transcription may require a data sharing agreement.

Fieldwork

Some types of fieldwork raise additional data management challenges: for example, lack of reliable access to electricity or an internet connection may limit data collection and storage options.

In general:

Data collection and storage should be as secure as practically possible under the circumstances
Where compromises have to be made, data should be transferred to secure, properly backed-up storage as soon as feasible

Thus, for example, if fieldwork involves making recordings of interviews, the recording device should ideally be encrypted. If this is not possible, the recordings should be transferred to encrypted storage (and the unencrypted originals deleted) as soon as practically possible.

If a sufficiently robust internet connection is available, consider uploading data to secure cloud-based storage such as the University's Nexus365 OneDrive for Business at the earliest opportunity. This reduces the risk of data loss as a result of portable storage media being damaged, lost, or stolen while travelling. It is also possible to make back-up copies of hard copy data (e.g. consent forms or written questionnaires) by taking photos of these and uploading them to Nexus365 OneDrive. If portable media need to be used, these should be encrypted, and it is sensible to make multiple copies, and keep these in different places.

However, some countries impose restrictions on travelling with encrypted devices. This has the potential to cause problems at customs, and in extreme cases, can lead to devices being confiscated or to fines or other penalties. Check the situation in your destination before leaving the UK, so you can plan accordingly. If using an encrypted device may be problematic, uploading your data to Nexus365 OneDrive is a good option: this means that sensitive data does not need to be stored on the portable device, but can still be accessed easily once back in the UK.

Research that forms part of a University of Oxford project needs to comply with relevant UK legislation such as GDPR, even if collection is taking place in another country. You may also need to take local legislation into account.

Recording and transcribing interviews

Recording an interview is a convenient way to save notetaking and allow you to concentrate on asking questions and responding to the interviewee's answers. Having a transcript of the interview means you can quickly search, browse, and refer to the material without having to listen to the whole recording.

You should always obtain explicit permission from the participants before recording any conversations. A video or audio recording will contain personally identifiable information, even if that is just the voice or face of the interviewee. As with all personal data, you need to make sure the recording is secure, for example by encrypting the file or only keeping it on secure, encrypted storage (see the Keeping working data safe section). Many recording devices, such as digital voice recorders or dictaphones, cannot be encrypted or store the recordings in an encrypted form. If alternative equipment is not available, you can mitigate the risk by deleting the recording data as soon as possible (after copying to your encrypted storage) and locking the device and its storage media away when not in use.

Another option is to record using Microsoft Stream, which uploads the recording directly to the University’s secure Nexus365 service. As Nexus365 has been approved for use with all kinds of University data, it is a good place to store any personal data that you're collecting. Nexus365 also gives you access to the transcription feature of Microsoft Stream or Word online, which you can use to automatically transcribe material at no cost. For more information, see our guide to transcribing interviews.

For remote video interviews, Microsoft Teams is the University's recommended platform. Participants do not need a Microsoft account: it is possible to send a link to a Teams meeting that will let them attend as a guest, using either a browser or the Teams app (which is free to download). Recording of Teams meetings can be enabled on request.

The Research Governance and Ethics FAQs include guidance for researchers working remotely with participant data.

Expand All

Finding and reusing existing data

You may discover that the data you need has already been collected by an earlier project, and that you are able to reuse it, thus reducing duplication of effort. Alternatively, you may be able to build on a pre-existing dataset, or combine one with your own data to expand the scope of your project. It can also be helpful to see the approach taken by other researchers: this may provide pointers on the most appropriate collection and processing methods, data structure, or metadata standards.

Some research funders require you to state the steps taken to identify relevant datasets in your data management plan.

Ways of finding existing data

Visit the Bodleian Data Service web pages for information about finding, accessing, and using datasets for research
Browse or search data archives and repositories: Re3data and FAIRsharing both offer extensive catalogues of repositories in a wide range of subject areas
Search for datasets in the University of Oxford's institutional archive, ORA
Check recent publications in your area: a number of major funders (including the UK Research Councils) now require published results to include information about how to access the underlying data

Using shared data

Acquaint yourself with any terms of use or restrictions

Before making use of a dataset, it is important to make sure you are aware of any conditions imposed by the data creators. If the data has been made available under a licence, this may specify what can and cannot be done with the data.

As with data that you collect or generate yourself, there may also be ethical or legal restrictions that need to be respected - for example, personal data will need to be treated in accordance with GDPR. If data has been anonymised, it is not generally acceptable to attempt to reidentify the data subjects.

Make use of documentation

Datasets made available for reuse will normally be accompanied by documentation or other contextual information. It is worth taking some time to familiarise yourself with this: it will often include important details which will reduce the risk of misinterpreting the data.

Take care not to misrepresent

Having a thorough understanding of a dataset can also help you to avoid inadvertently misrepresenting its contents. For example, documentation accompanying survey data may provide information about how the survey sample was recruited, which in turn may help to flag up ways in which the sample may be unrepresentative, and thus to avoid drawing stronger conclusions than the data actually supports.

When combining multiple datasets, be alert to any differences in the way the data is presented which have the potential to lead to misunderstandings. For example, suppose you are dealing with two datasets which both record distances: one which rounds to the nearest kilometre, and one which gives an exact number of metres. If you convert the first set of measurements to metres, and then combine the two datasets, there is a risk of giving the impression that the data from the first dataset is claiming a greater degree of precision than is in fact the case. Annotation or documentation can help avoid confusion in cases of this kind.

Similarly, if you manipulate, cleanse, or otherwise process a dataset to make it suitable for your own purposes, it is important to keep careful records of what has been done, and to make it clear that your analysis is based on your processed version rather than the original.

Give credit where it's due

If you make use of someone else's dataset in the course of preparing a research output such as an article, a thesis, or a conference presentation, you should include proper citations, following whatever conventions are usual in your discipline. If a dataset is a significant source, it may also be appropriate to describe it and how you have used it in the body of the work, and/or to mention the data creators in your acknowledgements.

Even if data is in the public domain, it is still good academic practice to ensure it is properly cited - just as you would still cite a book or an article that you use as a source even if it is now out of copyright.

If you have questions about the reuse of third party material, you can contact the Bodleian's Copyright and Licensing Specialist.

How to cite a dataset

Citing a dataset is similar in many ways to citing a textual source such as a journal article or book. If the citation style in use within your discipline does not provide specific guidelines for data citations, you can adapt the style normally used to cite books. A data citation will typically include the following elements:

Author: The data creator or creators.

Title: The name of the dataset. If the dataset does not have a separate title of its own, give the name of the study the dataset is associated with, or a brief description of it. Where it is helpful to aid clarity, the word '[Dataset]' may be added immediately after the title.

Edition and/or version number: May not be applicable in all cases, but where multiple versions of a dataset exist, it is important to indicate which one is being referenced.

Publication date: The date that the dataset was made available. If the data is not generally available for reuse (e.g. data which has been shared privately), give the date when the dataset was finalised.

Publisher: This may be an institution or individual responsible for making the dataset available, or the repository which hosts it.

Location or identifier: If the dataset (or a metadata record for it) is available online, provide a link via which it can be accessed. Ideally, this will be a persistent identifier such as a DOI (digital object identifier); if this is not available, a URL may be given.

Sometimes you may wish to refer to a specific part of a dataset. There are fewer well-established conventions for doing this than for print media, but if a dataset is split into multiple files or sections, each with its own identifier, then these can be used to give a more precise citation. Where this is not the case, or is insufficient, it may be appropriate to include a note indicating how the relevant data can be located within the dataset. If you have created your own version of a dataset (e.g. by filtering or reorganising the data) and the reuse conditions permit it, you may wish to make a copy of this available (with the source of the data clearly indicated), to make your conclusions more easily reproducible.

How to Cite Datasets and Link to Publications is a guide from the DCC which provides further information on this topic; see also our round-up of Data citation guidance from around the web.

Copyright and intellectual property considerations for shared data

If you are working with data that you do not control the rights to, this may have an impact on whether you can make that data available at the end of your project. See the Planning for sharing section of the Sharing data page for more on third party data.

Expand All

Preparing your data for analysis

Cleansing and formatting

Before starting to analyse your data, you'll want to make sure it's in the best possible shape. Data may need cleansing or formatting to make it as straightforward as possible to work with, and to reduce the risk of erroneous conclusions being drawn. Depending on the type of data, this may involve:

Correcting errors and (where possible) filling in missing data
Removing duplicates
Resolving or removing outliers
Ensuring null (missing) values are properly distinguished from zeros
Standardising formatting, spelling, or abbreviations
Ensuring data is consistently presented and labelled

Some types of data present special challenges. For example, in transcription of historical records, variant spellings may be an important aspect of the original data, but a standardised version may aid searching. In such cases, it may be appropriate to create a copy of the dataset containing both the original and the standardised versions.

If multiple datasets need to be combined (for example, because multiple people were involved in data collection), check these for consistency before merging them.

For a small dataset, it may be practical to cleanse and format data manually, but for larger ones, some degree of automation is likely to be necessary. Software packages used for data analysis often include tools to help cleanse data. Programming languages such as R and Python can also be used for this purpose.

As far as is feasible, ensure any steps you take are reversible. Save copies of the raw data files before cleansing or reformatting, so it is always possible to go back and consult the original if needed, and keep a record of what has been done to the data.

Pseudonymisation

If your data includes personal identifiers, consider whether data should be pseudonymised. Pseudonymisation involves replacing personal identifiers (such as names or email addresses) with artificial identifiers (such as ID numbers). A version of the data containing the personal identifiers may be retained, allowing the data subjects to be reidentified if necessary, but day-to-day analysis is carried out using the pseudonymised version. This technique can be used to limit the number of people who have access to the full dataset, and also reduces the risk of accidental disclosure of personal information.

Organising your data

Methods of organising or structuring data should be considered at an early stage in the project. This is a complex area, and the appropriate solution will depend on the type of data you are working with, and what you want to do with it.

It's worth taking some time to investigate the various tools and methods that are available, to ensure you're using what's best suited to your data. For example:

For straightforward, tabular data, a spreadsheet may provide all the functionality you need
For data with a more complex structure (particularly if multiple types of entity are involved), a relational database may be a better option
For semi-structured or unstructured data, a document-oriented database (using XML, for example) or a qualitative analysis package such as NVivo may be what's required
For more specific types of data, a specialised software package may be the best alternative

If you're uncertain which approach is most appropriate, or would like to learn more about the options, it may help to talk through your plans for your project with a more experienced colleague or with a member of the Research Data Oxford team. If you would like to make an appointment with the latter, email researchdata@ox.ac.uk.

Additionally, the IT Services Digital Capabilities Team offers a wide selection of courses which provide an introduction to a range of technologies and techniques.

File structure

Much time can be saved by having a file structure which is intuitive to navigate and which makes it easy to find files. If multiple people are using the same set of files, it may be worth documenting the structure that has been agreed, so everyone knows where to save new files and find existing ones.

When grouping files into hierarchical folders, it's usually best to aim for a balance between breadth and depth: so no one category gets too big, but so that you also don’t have to click through endless sub-folders to find a file

It's good practice to review file structures regularly, to check whether they still meet the needs of the project. It may be helpful to periodically move older files that are no longer being worked on to a folder called 'Archive' (or whatever seems appropriate): this helps to keep the working environment uncluttered, while still allowing older files to be easily retrieved when needed.

If files seem to belong in multiple places within a folder structure, shortcuts can be used to allow the same file to be accessed from more than one place. (Individual team members may also wish to use shortcuts to create a personal folder of files they are currently working on.)

Both Windows and Mac now also permit tags to be added to files, which provides an alternative method of organisation, and allows related files from across the folder structure to be labelled and then quickly reidentified.

If you need to share a collection of files, or are creating a copy for deposit in an archive, the file and folder structure can be preserved by creating a zipped folder.

File and folder names

When using modern file storage systems like OneDrive with a rapid searching facility, it is easier to forget to name files in a logical way. However, it's still worth doing: you may want to share files, or move them to a different platform.

The ideal file or folder name is reasonably concise, but informative: it makes life easier if you can tell what’s in an item without having to open it.

Being consistent in your naming practices will also make it easier to identify the item you want. Within a research group, you may want to agree on file and folder naming conventions early on in the project. Document your decisions, and store a copy of the document somewhere accessible to all members of the team.

Operating systems usually default to sorting files alphabetically, so it can be helpful to think about which element comes at the start of the name – is it more useful to order the items by date, by author, or by subject, for example? If you're including a date, consider using international standard date notation (YYYY-MM-DD), to aid easy chronological sorting. To force a particular order on a set of files or folders, you can add a number at the start of the name.

You may also want to consider having a standard set of keywords or labels which are included in file names. For example, does the file contain raw or processed data? Is it a draft or a final version? Would a version number be helpful for keeping track of the most recent copy? If a file has been worked on by one particular member of the team, would it be useful to indicate this by including their initials, so people know who to ask about it?

File formats

When planning your project or writing a data management plan you should consider the best format to store the working data. The choice may be limited by the software you are using: some features of software packages may not be available unless you are using the native format of the application. For example, Microsoft Word for the web only supports live collaborative editing when using the .docx format.

In contrast, file formats for sharing your data should be non-proprietary if possible, allowing the widest range of software to open and work with the data stored within the file. For example, a comma-separated values (.csv) file can be opened with a much wider range of software than the proprietary Excel .xlsx format. If you are collaborating with other researchers, you should agree a common data format with them if possible. Using a non-proprietary format can also help ensure the data remains usable for as long as possible, as you aren't dependent on a particular piece of software (or even a particular version of the software) remaining available.

Where you have a choice in file format, here are some factors to bear in mind:

For traditional software installed on your device, consider:

Compatibility with the operating system your computer uses (e.g. Windows, macOS)
Formats which are practical to share or integrate with an approved cloud service like Nexus365

For cloud Software as a Service (SaaS):

When planning to use free versions of ‘Freemium’ services, be aware that the terms could change during the lifetime of your project: the cloud provider could cease offering that service entirely or could start charging a fee. If possible, avoid relying on free third party services
Check that the supplier running the service allows for an easy export of your data, ideally as a bulk download which preserves the structure of your data exactly

For both SaaS and installed software, consider:

Formats already used in your department or division or other projects similar to your research
Established discipline-specific standards
Formats used by colleagues (both academic and support staff) and local expertise in working with that type of data
The way you will be analysing manipulating and storing your data
A format which can be annotated with metadata so that you and your colleagues/supervisor can understand exactly what the information is and how it was gathered

Expand All

Making research reproducible

Traditional research outputs, such as papers published in scholarly journals, will typically include details of the methods used. This allows for peer review and is established good practice.

Reproducible research builds on this principle: in addition to the paper, enhanced methods information and the underlying datasets are also made available. This goal is to allow others to easily reproduce the results of the research, and (with suitable permission and acknowledgement) to facilitate new work from the data.

Online repositories and code sharing sites make this easier to achieve. Researchers may choose to publish (for example) the R markup code or a Jupyter notebook used to achieve their results. This feeds into a fully reproducible method, from which consistent conclusions can be drawn.

The documents from the Academy of Medical Sciences 2015 Symposium give a good introduction to the issues that reproducible research is trying to address.

Reproducible Research Oxford (RROx)

Oxford has a local network of the UK Reproducibility Network, Reproducible Research Oxford (RROx), which works to promote a coordinated approach to open scholarship and research reproducibility in all disciplines.

Expand All

Documentation and metadata

Documentation is simply the contextual information needed to aid proper interpretation of data. It can be thought of as a user's guide to a dataset.

Good documentation makes material understandable, verifiable, and reusable. While this is particularly important if data is going to be shared, it's also often valuable to the data creator, especially if some time has passed since the data was first gathered.

Metadata is data about data. The term is sometimes used interchangeably with 'documentation', but is often used to refer to a more structured collection of details about a dataset, which conforms to set standards (a metadata schema), and which may be designed to be machine readable. Catalogue records for datasets held in an archive are examples of this sort of metadata.

It is good practice to document data as it is collected and worked on. This is generally much easier (and in the long run, quicker) than trying to do it retrospectively, at the end of a project. Documentation procedures should be set out in a project's data management plan.

Documentation may be provided for a whole dataset, or for specific aspects of it. It should generally include:

Information about when, where, and by whom the dataset was created, and for what purpose
A description of the dataset
Details of methods used
Details of what has been done to the data - for example, has it been cleansed, edited, restructured, or otherwise manipulated, and if so, how?
Explanations of any acronyms, coding, or jargon
Any other notes which will help aid proper interpretation

The UK Data Service provides an excellent overview of this topic.

A wide range of metadata schemas exist for data from different disciplines: these aim to formalise the information needed to make a particular type of data as reusable as possible. The Digital Curation Centre has gathered information about this on their Disciplinary Metadata page, and FAIRsharing also offers an extensive catalogue of data and metadata standards.

Supplying appropriate metadata is an important part of making data FAIR - that is, Findable, Accessible, Interoperable, and Reusable. This topic is covered in more detail in the Sharing data section.

Electronic lab notebooks

There is a long tradition of keeping records of lab-based research in paper notebooks. Electronic lab notebooks (ELNs) offer a digital alternative to these, with a number of potential benefits: they are more easily searchable, and if the system is web-based, can be accessed from any location with internet access.

From a data management perspective, a key advantage of ELNs is that they allow research data to be stored alongside experimental records, and linked to it. They also make it easier to retrace the history of a dataset: details of when an item was created (and by whom), and of any edits subsequently made can be captured automatically. This helps to make the research process more transparent, and aids reproducibility.

The University of Oxford has a subscription to the LabArchives ELN system. This is available free of charge to all researchers, including graduate students. LabArchives is a secure web-based system, and can be used to share material with collaborators within and outside the University.

Retention of research notebooks

Completed notebooks should be retained for a minimum of six years, and ideally longer if possible. This ensures details of the research are available if needed to support findings, or to assess compliance with regulations. If a laboratory notebook contains details of an invention which has been patented, Oxford University Innovation (OUI) should be consulted before the notebook is disposed of, regardless of the time period which has lapsed.

LabArchives is designed to provide a permanent record of research. If you are the owner of LabArchives notebooks and will be leaving the University, you should arrange for these to be transferred to your PI, supervisor, or another appropriate person. With agreement from your department, you will be able to export an electronic copy so that you can retain access to your notebooks after leaving Oxford.

Versioning

Versioning (also known as version control) is the process of keeping track of different versions of a file as it passes through the process of being revised. It serves two key purposes:

To ensure that it's always obvious which version is the most recent
To allow easy identification and retrieval of earlier versions where needed

There are various ways of doing this, and the important thing is to select a method that’s appropriate for the type of data you’re working with.

If your data is fairly straightforward, it may be sufficient simply to add a version number to the file name when a new version of the dataset is created. A common convention is to use whole numbers for major revisions, and decimals for minor ones: hence the initial version of a file might be called 'File v1', another copy saved after minor editing 'File v1.1', and a substantially revised version 'File v2'. Underscores or hyphens can be used in place of the decimal point if desired.

For more complex projects – especially those with multiple collaborators – it may be advisable to use a system that has file management capability built into it. OneDrive for Business offers basic versioning; Sharepoint Online has more sophisticated automated functionality, which can be combined with a check-out / check-in feature which ensures that edits are made in a managed way (without the risk of one user accidentally overwriting another's changes), following a formal workflow if required.

The LabArchives electronic lab notebook service keeps detailed records of changes made, allowing the full history of files to be retraced. It is designed to be sufficiently robust to provide the evidence that may be needed to defend a patent or resolve a publishing dispute.

For those managing their own storage space, a range of specialist file management or document workflow software is available.

Expand All

Research software management

Software and code have become an integral part of research in many disciplines, from astronomy and computational sciences to humanities. Research software is developed to analyse, model and simulate data, and is often developed by researchers with a specific research function in mind.

Research software is defined as any piece of code or script that enables researchers to process, manipulate, generate, and analyse data, or automatise or test these procedures.

Software and code created for research should be managed as a research output in its own right. It has taken time and resources to develop, and contributes to the overall reproducibility of your research. Software can and should be managed, and can contribute to greater impact and reach of your research.

However, managing software and code can be complicated. There may be different versions, dependencies, and code libraries, so it is important to develop a software management plan to account for the development process of your software and mitigate against potential risks during the development process. In addition, ensuring that your software or code can be preserved for the long term once your project ends can be challenging. To some degree, software can be managed in a similar way to research data, as a digital object, however, the executability, composite nature, and continuous evolution and versioning mean that these additional aspects need to be considered when managing software.

Like data, research software should be FAIR – Findable, Accessible, Interoperable, and Reusable. Some best practice guidance for software management can be found below, but you can contact the Research Data Oxford team or Reproducible Research Oxford for more specific guidance and support.

Reproducible workflow best practices for research software

Use a code versioning platform

A code versioning platform such as GitHub or GitLab should be used to keep the code and monitor changes and issues.

Provide clear documentation

A clear description of the software and steps to be followed to install and use it, including any dependencies, should be documented. In addition, the procedure to contribute to the code should also be stated along with the contact details of the owner of the code.

State requirements, environment managers, and containers

Requirements regarding the libraries, modules and versions and other dependencies should be clearly indicated. Environment managers such as Conda and Pipenv, or containers such as Docker, need to be specified and used.

Use clean code

Good coding practice should be used where possible, independent of the programming language. This ensures that others can understand your code and also easily find and fix bugs or other errors.

Test your code

Ensure that your code base is robust and free of errors. This can enable greater reproducibility of your research software.

Apply a license

Applying an appropriate software license ensures that others know the conditions of access and reuse for your research software.

Include a citation file

Including a citation file ensures that others know how to cite your research software, and can help make sure you receive credit for your research software.

Data handling and acquisition

Data collection

Check for existing data first

Data collection needs to be consistent and accurate

Data should be made as straightforward as possible to work with

Data collection must be ethical and compliant

Online survey platforms

Electronic devices

Hard copy surveys

Data reuse

Acquaint yourself with any terms of use or restrictions

Make use of documentation

Take care not to misrepresent

Give credit where it's due

Working with your data

Cleansing and formatting

Pseudonymisation

Reproducible research

Keeping good records

Retention of research notebooks

Research software

Use a code versioning platform

Provide clear documentation

State requirements, environment managers, and containers

Use clean code

Test your code

Apply a license

Include a citation file