This story is from the category Innovation and the dossier Data Lifecycle Management

The problem with metadata

An expert outlines the difficulties involved in data lifecycle management for research.

Text: Konrad O. Jaggi, published on 01.04.2014

Systems biology is an area of research currently receiving a great deal of government support in Switzerland. Still a young field, it is concerned with understanding and modelling entire biological systems – for example, how a microorganism such as yeast behaves in different environments or how a biological process such as plant growth works. This is complex fundamental research that hinges on collaboration between different research disciplines. SystemsX.ch is the national initiative in charge of promoting research projects in systems biology. The SyBIT project supports all SystemsX.ch projects in their research, some of which is highly data-intensive. One of its tasks is handling Data Lifecycle Management DLCM for SystemsX.ch.

SWITCH Journal: Who is responsible for the data?
Peter Kunszt: That varies a lot. As far as gene sequencing, mass spectrometry and microscopy are concerned, for instance, the Federal Institute of Technology and the universities have core facilities providing professional measurement services under controlled conditions. Active data management depends very much on the project in question. When data are produced at a core facility, there are precisely defined procedures that are subject to constant refinement. These are less well defined in the individual laboratories. SyBIT provides a range of software that makes life easier for both the core facilities and the individual laboratories. 

What sort of data volumes are being produced?
The biggest data producers generate as much as several dozen terabytes a year. Smaller microscopes and simple cell counters, meanwhile, produce volumes in the 100-gigabyte range. 

Does your institution have a policy regarding DLCM?
In fundamental research, people often want to keep all of their data because they are not entirely sure what may be hidden among them. With types of data where the content is already more familiar, people know what they can delete. Unfortunately, this only applies to a minority of our data. However, we are steadily developing a better understanding of when it is simpler to repeat a measurement rather than store the data. New measurements are getting quicker and more precise all the time. In SyBIT, we are working on practicable guidelines and standards to roll out within SystemsX.ch. 

Who do the data belong to?
The taxpayer – they are public data, and the copyrights are held by the cantonal universities and the Federal Institute of Technology.

How do you intend to ensure that data remain accessible far into the future?
We do not yet have a solution for that at present. If you are lucky, there will be an international repository for your data. However, we have seen cases where such repositories lost their funding and had to shut up shop. Either that, or they were privatised and will now only provide access to data against payment. It is unfortunate that there is no Swiss archive for research data. Technology is not the problem. It is much more important to determine who should be responsible for it Switzerland. In principle, this is a political decision because it concerns the long-term funding of one or more national research data archives. 

Where do you see requirements and challenges increasing in the coming years?
Volumes will continue to grow. Eliminating duplication and improving compression processes will help to keep data volumes down, but we need to arrive at a better understanding of how to employ these new technologies as effectively as possible. Problems with access and copyright only really occur in medicine. Data will become ever more complex. Even now, it is not the raw data that causes us problems, it is correct indexing and metadata. There is a lot that cannot be automated here, and structuring data requires highly specialised expertise. 

How important is coordinated DLCM in your field?
It is hugely important, but a lot of people still fail to realise this. All new methods derive from previous findings and data. These must be made available and understandable – not just for researchers, but for teachers too. The first steps towards providing research with better DLCM support have already been taken. The SyBIT project is one of them. The important thing now is to follow these through and ensure that they become established in a lasting way. 

This article appeared in the SWITCH Journal October 2013.
About the author
Konrad O.   Jaggi

Konrad O. Jaggi

After studying in Zurich and Aberdeen (UK), Konrad O. Jaggi managed several IT and information departments as well as a number of strategic planning projects. He has been in charge of Researchers & Lecturers at SWITCH since October 2011.

E-mail

Peter Kunszt

After graduating in theoretical physics, Dr Peter Kunszt worked at CERN and managed several projects, including major EU grid projects. He is now in charge of the SyBIT project as part of SystemsX.ch, the Swiss initiative for systems biology, at the Federal Institute of Technology in Zurich.

Other articles