Make your data count

Comparing the costs of data curation with the Curation Costs Exchange

This afternoon I had the pleasure of trip to London to join colleagues from the Digital Curation Centre, the BBC, and various academic institutions to road test a recently-launched tool for comparing the costs of data curation. Built as part of the EU-funded ‘4C’ Project (collaboration to Clarify the Costs of Curation’, the ‘Curation Costs Exchange’, or ‘CCEx’ as it is abbreviated, enables institutions to input information about their data curation costs in order to compare them against those of similar institutions. The idea is that such transparency aids institutions in keeping costs down, helps budget for future services, and encourages collaboration and sharing. The website also explains the cost components of curation, provides a discussion forum, and a quick guide to curation services offered elsewhere.

First impressions were very positive – the website has a slick, well-designed interface, but I was not entirely certain exactly what was being costed. Coming from a service-development background, my assumption was that the tool would help me get an idea of the costs of implementing curation services in general, but after a bit of poking about and clicking on things it became clear that the scope of the tool is actually very specifically focused on data ingest and preservation. The user indicates how much data they have, and in what formats, then inputs how much the various elements of the process cost in terms of staff time, hardware, software, overheads, etc. and then gets to compare these costs with other similar institutions. This it does very well, with nice graphical representations of one’s own costs in comparison with broader averages. I can see this being a useful tool for institutional repository managers (see the forthcoming ORA Data) wanting to check that they’re not spending over the odds on the services they provide.

There are some aspects in which the CCEx could be improved, as we discussed during the workshop. At present, one can indicate how many copies of each dataset are being preserved, but not whether they are being preserved on tape or spinning disk (which has an impact on storage and retrieval costs), nor is it easy at present to distinguish between high-curation models (where librarians or archivists check and edit the descriptive metadata upon data submission), or low-curation models, where data and metadata are accepted ‘as is’. Obviously a repository employing the former model will have higher costs than those adopting the latter, but this is also likely to be reflected in future in the levels of data re-use. There were also questions about the extent to which researchers would need to be involved in the process to indicate the pre-ingest costs of the data management of any given data deposit in order to avoid costs simply being pushed from one part of the institution to another. As we discovered during a survey of research database costs at Oxford back in 2011, researchers are not used to trying to capture such costs, many of which are covered by departments or concealed in general overheads.

Whilst the CCEx will be able to make more accurate comparisons between institutions once these issues are better addressed, it is already looking like a useful tools that will help drive down data repository costs over the long term.will be able to make more accurate comparisons between institutions once these issues are better addressed, it is already looking like a useful tools that will help drive down data repository costs over the long term.