What to do with the Paul Scherrer Institute's research data? The head of the IT department explains.
The Paul Scherrer Institute (PSI) in Villigen in the canton of Aargau is Switzerland's largest research centre for natural sciences and engineering. It focuses on three main subject areas: Matter and Material, Energy and the Environment, and Human Health. The PSI develops, constructs and operates complex, large-scale research facilities that more than 2,000 scientists from all over the world visit every year to conduct their experiments. These facilities produce vast quantities of data.
Konrad O. Jaggi: Who is responsible for the data?
Stephan Egli: The data are created primarily by the experiments. The main producers are the photon beams of the Swiss Light Source – a major experimental facility – and the particle physics lab, which carries out fundamental research into rare types of decay. As yet, there is no one with the official title of Data Manager. Main responsibility for the data therefore lies with the research teams. Each group currently has to decide for itself whether its data are to be exported to its home institution and, if so, how that is to be done.
What sort of data volumes are being produced?
In 2012, the experiments produced data on tape totalling some 250 terabytes (1 terabyte = 1,0004 bytes). The total quantity stored at present is 1.6 petabytes (1 petabyte = 1,0005 bytes). However, not all data are systematically saved to tape. A large quantity has to be compressed and exported these days as there is not enough local storage capacity to keep data for longer periods. They have to be deleted after just a few weeks or months.
Does your institute have a policy regarding data lifecycle management (DLCM)?
We do not have a policy yet, but talks are under way to establish one based on international proposals, mainly those worked out as part of the EU FP7 projects. These include the PaNdata Europe and PaNdata ODI projects in particular.
Who do the data belong to?
As a rule, they belong to the researchers who produce them. The Paul Scherrer Institute makes no claim to intellectual property rights vis-à-vis external users of its research facilities, provided their results are published. Separate contractual agreements apply with regard to handling data from collaborations with industry.
How do you intend to ensure that data remain accessible far into the future?
Tape media will continue to be used for long-term storage, not least because this is the most energy-efficient solution. A lot of work still has to be done to standardise data formats. HDF5 is an important data format that supports annotation using metadata. I see standardisation as more of a long-term process. On the IT side, infrastructure must be in place that enables the efficient migration of data to the latest media and technologies.
Where do you see requirements and challenges increasing in the coming years?
IT departments need to provide infrastructure that makes researchers' lives easier and frees them from repetitive tasks. Researchers, for their part, must play a part in defining quality criteria and continuing to develop data formats and metadata. The science communities and their members must be involved in the DLCM process. Another thing I see as a key challenge is coping with the exponential growth in data volumes and with people's expectation that they can easily access and exchange data at any time, wherever they are. On top of all this, solutions must be found that are cost-optimised. Considerable pressure will therefore arise to find synergies within the Swiss academic community.
How important is coordinated DLCM in your field?
All parties involved should view DLCM as an integral part of the research process. I believe that it is a strategic necessity for any research operation. It should be set out clearly in a data policy document.